International Federation of Digital Seismograph Networks

Thread: Next generation miniSEED - 2016-3-30 straw man change proposal 12 - Reduce record length field from 4 bytes to 2 bytes

None
Started: 2016-08-11 19:58:23
Last activity: 2016-08-25 01:48:01

Hi all,

Change proposal #12 to the 2016-3-30 straw man (iteration 1) is attached: Reduce record length field from 4 bytes to 2 bytes.

Please use this thread to provide your feedback on this proposal by Wednesday August 24th.

thanks,
Chad




  • Hi

    I think there should be a separation from what a datacenter permits in
    its ingestion systems and what is allowed in the file format. I have
    no problem with a datacenter saying "we only take records less than X
    bytes" and it probably also makes sense for datacenters to give out
    only small sized records. However, there is an advantage for client
    software to be able to save a single continuous timespan of data as a
    single array of floats, and 65k is kind of small for that. I know
    there is an argument that miniseed is not for post processing, but
    that seems to me to be a poor reason as it can handle it and it is
    really nice to be able to save without switching file formats just
    because you have done some processing. And for the most part,
    processing means to take records that are continuous and turn them
    into a single big float array, do something, and then save the array
    out. Having to undo that combining process just to be able to save in
    the file format is not ideal. And keep in mind that if some of the
    other changes, like network code length, happen, the existing post
    processing file formats like SAC will no longer be capable of holding
    new data.

    And in this case, the save would likely not compress the data, nor
    would it need to do the CRC. I would also observe that the current
    miniseed allows records of up to 2 to the 256 power, and datacenters
    have not been swamped by huge records.

    It is true that big records are bad in certain cases, but that doesn't
    mean that they are bad in all cases. I feel the file format should not
    be designed to prevent those other uses. The extra 2 bytes of storage
    to allow up to 4Gb records seems well worth it to me.

    thanks
    Philip


    On Thu, Aug 11, 2016 at 4:00 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:

    Hi all,

    Change proposal #12 to the 2016-3-30 straw man (iteration 1) is attached:
    Reduce record length field from 4 bytes to 2 bytes.

    Please use this thread to provide your feedback on this proposal by
    Wednesday August 24th.

    thanks,
    Chad





    ----------------------
    Posted to multiple topics:
    FDSN Working Group II
    (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
    FDSN Working Group III
    (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences at http://www.fdsn.org/account/profile/


    • Hi,

      My two cents is that the permitted length should be kept fairly small so 65k should be fine. I do not know how many times I have dealt with formats like SAC which can store a large time series segment with only a single timestamp for the first sample and have the time of the last sample be inaccurate because the digitizing rate is either not constance or is “slightly off”. Smaller record sizes forces more frequent recording of timestamps and improves timing quality.

      I also think variable length records is a really bad idea. I prefer fixed length records on power of two boundaries for a variety of reasons. Mostly it permits more rapid accessing of the data without having to build extensive indices for each data block.

      Dave

      On Aug 11, 2016, at 5:49 PM, Philip Crotwell <crotwell<at>seis.sc.edu> wrote:

      Hi

      I think there should be a separation from what a datacenter permits in
      its ingestion systems and what is allowed in the file format. I have
      no problem with a datacenter saying "we only take records less than X
      bytes" and it probably also makes sense for datacenters to give out
      only small sized records. However, there is an advantage for client
      software to be able to save a single continuous timespan of data as a
      single array of floats, and 65k is kind of small for that. I know
      there is an argument that miniseed is not for post processing, but
      that seems to me to be a poor reason as it can handle it and it is
      really nice to be able to save without switching file formats just
      because you have done some processing. And for the most part,
      processing means to take records that are continuous and turn them
      into a single big float array, do something, and then save the array
      out. Having to undo that combining process just to be able to save in
      the file format is not ideal. And keep in mind that if some of the
      other changes, like network code length, happen, the existing post
      processing file formats like SAC will no longer be capable of holding
      new data.

      And in this case, the save would likely not compress the data, nor
      would it need to do the CRC. I would also observe that the current
      miniseed allows records of up to 2 to the 256 power, and datacenters
      have not been swamped by huge records.

      It is true that big records are bad in certain cases, but that doesn't
      mean that they are bad in all cases. I feel the file format should not
      be designed to prevent those other uses. The extra 2 bytes of storage
      to allow up to 4Gb records seems well worth it to me.

      thanks
      Philip


      On Thu, Aug 11, 2016 at 4:00 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:

      Hi all,

      Change proposal #12 to the 2016-3-30 straw man (iteration 1) is attached:
      Reduce record length field from 4 bytes to 2 bytes.

      Please use this thread to provide your feedback on this proposal by
      Wednesday August 24th.

      thanks,
      Chad





      ----------------------
      Posted to multiple topics:
      FDSN Working Group II
      (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
      FDSN Working Group III
      (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

      Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
      Update subscription preferences at http://www.fdsn.org/account/profile/


      ----------------------
      Posted to multiple topics:
      FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
      FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

      Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
      Update subscription preferences at http://www.fdsn.org/account/profile/


      • On 08/17/2016 08:18 PM, David Ketchum wrote:
        Hi,

        My two cents is that the permitted length should be kept fairly small so 65k should be fine. I do not know how many times I have dealt with formats like SAC which can store a large time series segment with only a single timestamp for the first sample and have the time of the last sample be inaccurate because the digitizing rate is either not constance or is “slightly off”. Smaller record sizes forces more frequent recording of timestamps and improves timing quality.

        I also think variable length records is a really bad idea. I prefer fixed length records on power of two boundaries for a variety of reasons. Mostly it permits more rapid accessing of the data without having to build extensive indices for each data block.

        One alternative, which would be better suited for real-time, would be
        using fixed-size "frames" instead of records. Think of a record
        consisting of header frame followed by a variable number of data frames.
        A frame might include timecode (sequence no.), channel index (for
        multiplexing) and possibly CRC. Due to fixed size, finding the start of
        a frame would be unambiguous. Compared to a 512-byte mseed 2.x record
        (header + 7 data frames), latency would be 7 times smaller, because each
        data frame could be sent separately. And by using more data frames one
        could reduce overall bandwidth without increasing latency.

        Transmitting data in 64-byte chunks was already attempted with mseed
        2.4, but unfortunately the total number of samples and the last sample
        value must be sent before any data. In the new format I would put such
        values, if needed, into a "summary" frame that would be sent after data
        frames.

        Regards,
        Andres.


        • On Aug 17, 2016, at 1:06 PM, andres<at>gfz-potsdam.de wrote:

          On 08/17/2016 08:18 PM, David Ketchum wrote:
          Hi,

          My two cents is that the permitted length should be kept fairly small so 65k should be fine. I do not know how many times I have dealt with formats like SAC which can store a large time series segment with only a single timestamp for the first sample and have the time of the last sample be inaccurate because the digitizing rate is either not constance or is “slightly off”. Smaller record sizes forces more frequent recording of timestamps and improves timing quality.

          I also think variable length records is a really bad idea. I prefer fixed length records on power of two boundaries for a variety of reasons. Mostly it permits more rapid accessing of the data without having to build extensive indices for each data block.

          One alternative, which would be better suited for real-time, would be
          using fixed-size "frames" instead of records. Think of a record
          consisting of header frame followed by a variable number of data frames.
          A frame might include timecode (sequence no.), channel index (for
          multiplexing) and possibly CRC. Due to fixed size, finding the start of
          a frame would be unambiguous. Compared to a 512-byte mseed 2.x record
          (header + 7 data frames), latency would be 7 times smaller, because each
          data frame could be sent separately. And by using more data frames one
          could reduce overall bandwidth without increasing latency.

          Transmitting data in 64-byte chunks was already attempted with mseed
          2.4, but unfortunately the total number of samples and the last sample
          value must be sent before any data. In the new format I would put such
          values, if needed, into a "summary" frame that would be sent after data
          frames.


          I like this idea. I've been considering similar concepts, dubbed microSEED, with frames that are not necessarily fixed length. The idea was left out of the straw man because it's a pretty radical change from current miniSEED where each record is independently usable. Lots of existing software would require significant redesign to read such data. But, if this concept could be developed in such a way that multiple frames could be easily reassembled into a next generation miniSEED record it might be a nice way to satisfy both archiving and real-time transmission needs.

          Chad



          • On 08/19/2016 08:58 AM, Chad Trabant wrote:

            On Aug 17, 2016, at 1:06 PM, andres<at>gfz-potsdam.de wrote:

            On 08/17/2016 08:18 PM, David Ketchum wrote:
            Hi,

            My two cents is that the permitted length should be kept fairly small so 65k should be fine. I do not know how many times I have dealt with formats like SAC which can store a large time series segment with only a single timestamp for the first sample and have the time of the last sample be inaccurate because the digitizing rate is either not constance or is “slightly off”. Smaller record sizes forces more frequent recording of timestamps and improves timing quality.

            I also think variable length records is a really bad idea. I prefer fixed length records on power of two boundaries for a variety of reasons. Mostly it permits more rapid accessing of the data without having to build extensive indices for each data block.

            One alternative, which would be better suited for real-time, would be
            using fixed-size "frames" instead of records. Think of a record
            consisting of header frame followed by a variable number of data frames.
            A frame might include timecode (sequence no.), channel index (for
            multiplexing) and possibly CRC. Due to fixed size, finding the start of
            a frame would be unambiguous. Compared to a 512-byte mseed 2.x record
            (header + 7 data frames), latency would be 7 times smaller, because each
            data frame could be sent separately. And by using more data frames one
            could reduce overall bandwidth without increasing latency.

            Transmitting data in 64-byte chunks was already attempted with mseed
            2.4, but unfortunately the total number of samples and the last sample
            value must be sent before any data. In the new format I would put such
            values, if needed, into a "summary" frame that would be sent after data
            frames.


            I like this idea. I've been considering similar concepts, dubbed microSEED, with frames that are not necessarily fixed length. The idea was left out of the straw man because it's a pretty radical change from current miniSEED where each record is independently usable. Lots of existing software would require significant redesign to read such data. But, if this concept could be developed in such a way that multiple frames could be easily reassembled into a next generation miniSEED record it might be a nice way to satisfy both archiving and real-time transmission needs.

            All existing software would require significant modifications even with
            the current straw man (especially if variable length records are
            allowed). SeedLink, Web Services, all user software. The overall cost of
            the transition would be huge.

            If we want to design a format for the next 30 years, we should not
            restrict ourselves with limitations imposed by the current miniSEED
            format. On the other hand, if compatibility with the current miniSEED
            format is desired, just add another blockette to miniSEED 2.x (as
            suggested by Angelo Strollo earlier) and that's it.

            Back to the idea of "frames" -- indeed, some info that is needed for
            real-time transfer could be stripped in offline format. If records could
            be easily converted to frames and vice versa, it would be great.
            Currently the main problem is forward references (number of samples,
            detection flags, anything that refers to data that is not yet known when
            sending the header), so we need a "footer" in addition to header.

            Regards,
            Andres.


            • On Aug 19, 2016, at 5:55 AM, andres<at>gfz-potsdam.de wrote:

              On 08/19/2016 08:58 AM, Chad Trabant wrote:

              On Aug 17, 2016, at 1:06 PM, andres<at>gfz-potsdam.de wrote:

              On 08/17/2016 08:18 PM, David Ketchum wrote:
              Hi,

              My two cents is that the permitted length should be kept fairly small so 65k should be fine. I do not know how many times I have dealt with formats like SAC which can store a large time series segment with only a single timestamp for the first sample and have the time of the last sample be inaccurate because the digitizing rate is either not constance or is “slightly off”. Smaller record sizes forces more frequent recording of timestamps and improves timing quality.

              I also think variable length records is a really bad idea. I prefer fixed length records on power of two boundaries for a variety of reasons. Mostly it permits more rapid accessing of the data without having to build extensive indices for each data block.

              One alternative, which would be better suited for real-time, would be
              using fixed-size "frames" instead of records. Think of a record
              consisting of header frame followed by a variable number of data frames.
              A frame might include timecode (sequence no.), channel index (for
              multiplexing) and possibly CRC. Due to fixed size, finding the start of
              a frame would be unambiguous. Compared to a 512-byte mseed 2.x record
              (header + 7 data frames), latency would be 7 times smaller, because each
              data frame could be sent separately. And by using more data frames one
              could reduce overall bandwidth without increasing latency.

              Transmitting data in 64-byte chunks was already attempted with mseed
              2.4, but unfortunately the total number of samples and the last sample
              value must be sent before any data. In the new format I would put such
              values, if needed, into a "summary" frame that would be sent after data
              frames.


              I like this idea. I've been considering similar concepts, dubbed microSEED, with frames that are not necessarily fixed length. The idea was left out of the straw man because it's a pretty radical change from current miniSEED where each record is independently usable. Lots of existing software would require significant redesign to read such data. But, if this concept could be developed in such a way that multiple frames could be easily reassembled into a next generation miniSEED record it might be a nice way to satisfy both archiving and real-time transmission needs.

              All existing software would require significant modifications even with
              the current straw man (especially if variable length records are
              allowed). SeedLink, Web Services, all user software. The overall cost of
              the transition would be huge.

              If we want to design a format for the next 30 years, we should not
              restrict ourselves with limitations imposed by the current miniSEED
              format. On the other hand, if compatibility with the current miniSEED
              format is desired, just add another blockette to miniSEED 2.x (as
              suggested by Angelo Strollo earlier) and that's it.

              Hi Andres,

              You have a point that we should not be limiting our thinking. I do think there is a sweet spot in the balance between a small patch on the current miniSEED (in particular one that could very detrimental to data identification) and something radically different. The very first straw man was created with that particular balance in mind as a place to start discussion, with the full expectation that it would evolve. My feeling is that non-independent records, ala headers plus frames transmitted independently, is a more radical change than anything in the straw man from the perspective of code reading the data.

              As for the concept of "just" adding a blockette to extend the network code, all of software you mentioned (SeedLink, Web Services, all user software) in addition to data center schemas, data center software and, very importantly, data generation systems would need to be updated in order to not lose network identifiers. The libraries that do this parsing at the data center and user levels are the easy part, pushing updates out to all the places that use them will simply take a lot of time. As D. Ketchum wrote, updates will not be overnight. You can easily imagine there will be old versions of slink2ew, slink2ew, chain_plugin and many, many more pieces of middle-ware running for a very long time. In some cases they will be transforming the data from miniSEED to something else and silently stripping the network identifiers out. In other cases the new blockette(s) may be retained but all miniSEED3 data will need to be referred to as network "99" (or whatever) because the old system doesn't know any better. The overall cost of this transition would be huge, even for just adding a blockette.

              Surely we can address other fundamental issues such as record byte order identification, which cannot be fixed with a simple blockette, if we are going to effectively go through a full software stack update. Much of the same planning, such as getting systems and software updated well before any new style data flows, is similar.

              Back to the idea of "frames" -- indeed, some info that is needed for
              real-time transfer could be stripped in offline format. If records could
              be easily converted to frames and vice versa, it would be great.
              Currently the main problem is forward references (number of samples,
              detection flags, anything that refers to data that is not yet known when
              sending the header), so we need a "footer" in addition to header.

              A footer would work. Alternatively, the "micro" header on each frame could contain: start time of primary header (for sequencing), the starttime of the first sample in the frame, the number of samples in the frame and any optional headers relevant for the frame (detection). Reassembly to a full record would require summing up the sample counts, combining the optional headers and stripping the micro/frame headers. Some care would be needed with details. If we created such a telemetry framing for otherwise complete "next generation" miniSEED it would have the advantage of limiting the telemetry complexity to those systems that need it, allowing some degree of separation between the use cases of telemetry, archiving, etc. It's certainly an intriguing line of thought.

              regards,
              Chad

              Regards,
              Andres.

              ----------------------
              Posted to multiple topics:
              FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
              FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

              Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
              Update subscription preferences at http://www.fdsn.org/account/profile/

          • Chad Trabant schrieb am 19.08.2016 um 08:58:
            The idea was left out of the straw man because it's a pretty radical change from current miniSEED where each record is independently usable. Lots of existing software would require significant redesign to read such data.

            Thank you, Chad, for addressing an important point: the costs of the new
            format!

            Do you have a rough idea about what the costs of the transition to an
            incompatible new data format would be? Reading this discussion one might
            get the impression that the transition would be a piece of cake. A
            version change, a few modified headers, an extended network code plus a
            few other improvements like microsecond time resolution. Hitherto
            stubborn network operators will be forced not to use empty location
            codes. But all these benefits will come with a price tag because of the
            incompatibility of the new format with MiniSEED.

            So what will be the cost of the transition? Who will pay the bill? Will
            the costs be spread across the community or will the data centers have
            to cover the costs alone?

            There are quite a few tasks ahead of "us". "Us" means a whole community
            of data providers, data management centers, data users, software
            developers, hardware manufacturers. World-wide! I.e., everyone who is
            now working with MiniSEED and has got used to it. Everyone!

            Tasks will include:

            * Recoding of entire data archives

            * Software updates. In some cases redesign will be necessary, while
            legacy software will just cease to work with the new format.

            * Migrate data streaming and exchange between institutions world-wide.
            It is easy to foresee that real-time data exchange, which was pretty
            hard to establish in the first place with many partners world-wide, will
            be heavily affected by migrating to the new format.

            * Request tools: will there be a deadline like "by August 1st, 2017,
            00:00:00 UTC" all fdsnws's have to support to the new format? Or will
            there be a transition? If so, how will this be organized? Either access
            to two archives (for each format) will be required or the fdsnws's will
            have to be enabled to deliver both formats by conversion on the fly?

            * Hardware manufacturers will have to support the new format.

            * Station network operators will have to bear the costs of adopting the
            new format even though it may not yield any benefit to them.

            I could probably add more items to this list but thinking of the above
            tasks causes me enough headaches already. That's the reason why I am
            publicly raising the cost question now because the proponents of the new
            format must have been thinking about this and probably have some idea
            about how costly the transition would be.

            Speaking of costs I would like to remind you of the alternative proposal
            presented on July 8th by Angelo Strollo on behalf of the major European
            data centers. They propose to simply introduce a new blockette 1002 to
            accommodate longer network codes but with enough space for additional
            attributes such as extended location id's etc. This light-weight
            solution is backward compatible with the existing MiniSEED. It is
            therefore the least disruptive solution and minimizes the costs of the
            transition.

            Regards
            Joachim

            • Just want to point out that a new blockette with extended network code
              is NOT backwards compatible. Old software that does not recognize the
              new blockette (and therefore likely ignores it) will report it
              successfully read the data, but will attribute new data records to the
              wrong network. It may appear that this is a lower cost, however this
              would generate a new class of bugs that would likely be subtle and
              would persist for decades to come. There is pain in both ways, but I
              would much prefer a system that fails obviously when it fails to one
              that seems to work but actually is wrong infrequently and in a way
              that is hard to notice.

              A failure that looks like a failure gets fixed quickly, a failure that
              looks like a success can easily persist for a long time, causing much
              more damage in the long run.

              Philip

              On Fri, Aug 19, 2016 at 9:46 AM, Joachim Saul <saul<at>gfz-potsdam.de> wrote:
              Chad Trabant schrieb am 19.08.2016 um 08:58:
              The idea was left out of the straw man because it's a pretty radical change from current miniSEED where each record is independently usable. Lots of existing software would require significant redesign to read such data.

              Thank you, Chad, for addressing an important point: the costs of the new
              format!

              Do you have a rough idea about what the costs of the transition to an
              incompatible new data format would be? Reading this discussion one might
              get the impression that the transition would be a piece of cake. A
              version change, a few modified headers, an extended network code plus a
              few other improvements like microsecond time resolution. Hitherto
              stubborn network operators will be forced not to use empty location
              codes. But all these benefits will come with a price tag because of the
              incompatibility of the new format with MiniSEED.

              So what will be the cost of the transition? Who will pay the bill? Will
              the costs be spread across the community or will the data centers have
              to cover the costs alone?

              There are quite a few tasks ahead of "us". "Us" means a whole community
              of data providers, data management centers, data users, software
              developers, hardware manufacturers. World-wide! I.e., everyone who is
              now working with MiniSEED and has got used to it. Everyone!

              Tasks will include:

              * Recoding of entire data archives

              * Software updates. In some cases redesign will be necessary, while
              legacy software will just cease to work with the new format.

              * Migrate data streaming and exchange between institutions world-wide.
              It is easy to foresee that real-time data exchange, which was pretty
              hard to establish in the first place with many partners world-wide, will
              be heavily affected by migrating to the new format.

              * Request tools: will there be a deadline like "by August 1st, 2017,
              00:00:00 UTC" all fdsnws's have to support to the new format? Or will
              there be a transition? If so, how will this be organized? Either access
              to two archives (for each format) will be required or the fdsnws's will
              have to be enabled to deliver both formats by conversion on the fly?

              * Hardware manufacturers will have to support the new format.

              * Station network operators will have to bear the costs of adopting the
              new format even though it may not yield any benefit to them.

              I could probably add more items to this list but thinking of the above
              tasks causes me enough headaches already. That's the reason why I am
              publicly raising the cost question now because the proponents of the new
              format must have been thinking about this and probably have some idea
              about how costly the transition would be.

              Speaking of costs I would like to remind you of the alternative proposal
              presented on July 8th by Angelo Strollo on behalf of the major European
              data centers. They propose to simply introduce a new blockette 1002 to
              accommodate longer network codes but with enough space for additional
              attributes such as extended location id's etc. This light-weight
              solution is backward compatible with the existing MiniSEED. It is
              therefore the least disruptive solution and minimizes the costs of the
              transition.

              Regards
              Joachim

              ----------------------
              Posted to multiple topics:
              FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
              FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

              Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
              Update subscription preferences at http://www.fdsn.org/account/profile/

              • On 08/19/2016 04:37 PM, Philip Crotwell wrote:
                Just want to point out that a new blockette with extended network code
                is NOT backwards compatible. Old software that does not recognize the
                new blockette (and therefore likely ignores it) will report it
                successfully read the data, but will attribute new data records to the
                wrong network. It may appear that this is a lower cost, however this
                would generate a new class of bugs that would likely be subtle and
                would persist for decades to come. There is pain in both ways, but I
                would much prefer a system that fails obviously when it fails to one
                that seems to work but actually is wrong infrequently and in a way
                that is hard to notice.

                A failure that looks like a failure gets fixed quickly, a failure that
                looks like a success can easily persist for a long time, causing much
                more damage in the long run.

                A special 2-letter network code can be reserved. AFAIK there are even
                some obvious network codes, such as "99" or "XX" that have never been
                used. If data records are attributed to network "99", it is quite
                obvious what is going on. Yet, if I use my old PQLX to quickly look at
                the data, I don't care about the network code.

                Wasn't the network code added in SEED 2.3 in the first place? Any issues
                known?

                Regards,
                Andres.


                • On Aug 19, 2016, at 8:15 AM, andres<at>gfz-potsdam.de wrote:

                  On 08/19/2016 04:37 PM, Philip Crotwell wrote:
                  Just want to point out that a new blockette with extended network code
                  is NOT backwards compatible. Old software that does not recognize the
                  new blockette (and therefore likely ignores it) will report it
                  successfully read the data, but will attribute new data records to the
                  wrong network. It may appear that this is a lower cost, however this
                  would generate a new class of bugs that would likely be subtle and
                  would persist for decades to come. There is pain in both ways, but I
                  would much prefer a system that fails obviously when it fails to one
                  that seems to work but actually is wrong infrequently and in a way
                  that is hard to notice.

                  A failure that looks like a failure gets fixed quickly, a failure that
                  looks like a success can easily persist for a long time, causing much
                  more damage in the long run.

                  A special 2-letter network code can be reserved. AFAIK there are even
                  some obvious network codes, such as "99" or "XX" that have never been
                  used. If data records are attributed to network "99", it is quite
                  obvious what is going on. Yet, if I use my old PQLX to quickly look at
                  the data, I don't care about the network code.

                  Wasn't the network code added in SEED 2.3 in the first place? Any issues
                  known?

                  I agree with Philip, the proposed network extension blockette has a fundamental problem regarding backwards compatibility. It is only backwards compatible in that it can be read, but critical information will be quietly lost until a large number of legacy readers are replaced (which will take a very long time). Until then, when using legacy readers, all of the functions of a network code (ownership identification, logical station grouping) are lost with many implications. You can easily imagine older data converters being used for a long time and the expanded network code going missing right away. I predict it wouldn't take very long before network 99 shows up in publications.

                  I do not believe assertions that all users of SEED will think it obvious what is going on with network 99. The grad student doing their work with an old version of PQLX is simply not going to know.

                  As Philip says, it'd be better to break things than quietly continue to work while losing network identifiers.

                  Furthermore, even this small update would require modifications to all software chains, from data generations to data centers to users, along with database schemas, protocols, etc., etc. That is a huge amount of work for such a small change. If we are going to go through all of that we should at least fix some of the other issues with miniSEED. And now we are back at the beginning of this conversation that started in ~2013.

                  Chad


                  • Chad Trabant wrote on 20.08.2016 02:01:
                    I agree with Philip, the proposed network extension blockette has a fundamental problem regarding backwards compatibility. It is only backwards compatible in that it can be read, but critical information will be quietly lost until a large number of legacy readers are replaced (which will take a very long time). Until then, when using legacy readers, all of the functions of a network code (ownership identification, logical station grouping) are lost with many implications.

                    Hallo Chad,

                    what would be "a very long time"?

                    First of all note that most of the current infrastructures world-wide
                    will not be affected by the blockette-1002 extension at all. The reason
                    for this is that most institutions will simply not produce any data with
                    1002 blockettes because they don't need the extended attributes. They
                    will continue to produce and exchange 2.4 MiniSEED just as they have
                    been for many years. They will not have to upgrade their station
                    hardware/software in order to produce up-to-date, valid MiniSEED. NO CHANGE!

                    Of course, "most institutions" is not necessarily all and sooner or
                    later data with blockette 1002 will start to circulate. This will
                    require blockette-1002 aware decoders to make use of the extended
                    attributes.

                    The obvious question is now: How much time would it take to update
                    libmseed, qlib, seedlink et al. to support blockette 1002? A week? A
                    month? A year? A very long time?

                    As soon as blockette-1002 aware versions of said libraries are
                    available, the software using them needs to be re-compiled and linked
                    against them. A lot of software if not most is going to be
                    blockette-1002 enabled that way, without need for further modifications.
                    And, very importantly, the software can be made blockette-1002-ready
                    WELL IN ADVANCE of the actual circulation of blockette-1002 data!

                    This means specifically: If a consensus about the blockette 1002
                    structure can be found, say, by December (e.g. AGU), then the work to
                    make libmseed, qlib, seedlink et al. blockette-1002 ready and
                    subsequently the software that uses them will take at most a few more
                    months. With an updated libmseed, software like ObsPy and SeisComP will
                    support at least the extended attributes out of the box. I haven't
                    looked at the PQLX details but since it also uses libmseed to read
                    MiniSEED, a blockette-1002-ready libmseed should allow the transition
                    will very little (if any) further effort. I am therefore sure that most
                    relevant, actively maintained software can likewise be made
                    blockette-1002 ready before the Kobe meeting.

                    There are, of course, details that need to be addressed. For instance,
                    the proposed 4-character location identifier and how it is converted to
                    Earthworm's tracebuf format, as pointed out by Dave. But these problems
                    would be the same for blockette-1002 MiniSEED and the proposed new format.

                    You can easily imagine older data converters being used for a long time and the expanded network code going missing right away.

                    Older data converters WILL continue to work fine with all currently
                    existing MiniSEED streams. Whereas NO older data converters will work
                    with ANY data converted to the proposed new and entirely incompatible
                    format!

                    I predict it wouldn't take very long before network 99 shows up in publications.

                    This implies authors who don't have a clue about what a network code is.
                    How would they be able to correctly use a network code? That's not an
                    issue of data formats but channel naming in general.

                    I do not believe assertions that all users of SEED will think it obvious what is going on with network 99. The grad student doing their work with an old version of PQLX is simply not going to know.

                    Why not inform the grad student? What does it take for the grad student
                    to learn that in an FDSN network code context "IU" doesn't stand for
                    "Indiana University"?

                    http://www.fdsn.org/networks/detail/IU

                    That's all! In case that grad student happens to stumble upon "99" then
                    probably an explanation on http://www.fdsn.org/networks/detail/99 would
                    help him or her.

                    As Philip says, it'd be better to break things than quietly continue to work while losing network identifiers.

                    What do you mean by "things"? The proposed new format and its
                    implementation would not just break the grad student's PQLX but it would
                    break ENTIRE INFRASTRUCTURES. World-wide and from bottom to top!

                    Do you want to disrupt the entire FDSN data exchange to protect the grad
                    student using an old PQLX from getting a "99" network code? Is that what
                    you are saying?

                    Furthermore, even this small update would require modifications to all software chains,

                    You have a position and are trying your best to defend it. This is
                    legitimate of course. But are exaggerating minor problems in order to
                    discredit an approach that you cannot deny would be a lot less
                    disruptive and expensive than the proposed new format.

                    from data generations

                    No modifications are needed at the stations. Stations continue to
                    produce 2.4 MiniSEED which will remains valid. There is no need to
                    produce blockette 1002 except for stations that e.g. have extended
                    network or location codes. There will not be many (it any) in currently
                    existing networks.

                    to data centers

                    Data centers are the ones that benefit most from a continuity that the
                    blockette-1002 approach would allow because they neither need to recode
                    entire archives nor have to provide "old" and "new" data formats in
                    parallel.

                    to users

                    Only users that actually use blockette-1002 data. If these users use
                    up-to-date versions of actively maintained software such as ObsPy,
                    SeisComP or MiniSEED-to-SAC converters they will not notice any
                    difference. Legacy software will continue to work with the exception of
                    the network code that will show up as "99".

                    along with database schemas, protocols, etc., etc.

                    There are some cases where updates will require further efforts. We
                    already read about Earthworm and the limited space for the location
                    identifier in the current Tracebuf2 format. But the effort at the
                    Earthworm end to accommodate a longer location identifier would be the
                    same for blockette-1002 data as for the proposed new format. It is
                    therefore understandable that the Earthworm community has reservations
                    against an extended location code because it would have to pay the price
                    for something it probably doesn't need.

                    In general chances are high that most database schemas will remain
                    unaffected as well as most protocols.

                    But I am curious to hear about specific database schemas that would be
                    more difficult to update to blockette-1002 MiniSEED than to the proposed
                    new format.

                    That is a huge amount of work for such a small change.

                    I hope to have pointed out by now that the work required to implement
                    blockette 1002 would in fact be dramatically less compared to the work
                    required to upgrade entire infrastructures (indeed from the data loggers
                    all the way to data users) to a fully incompatible new format.

                    And now we are back at the beginning of this conversation that started in ~2013.

                    What conversation are you referring to?

                    Cheers
                    Joachim

                    • Hi

                      Just like to point out that merely upgrading a library, like libmseed,
                      to parse a new blockette does not make suddenly make older software
                      compatible with a longer network code. If the software itself is not
                      also upgraded to use the information in the new blockette then the new
                      information is effectively ignored. I feel that this idea that there
                      is a non-disruptive, easy "fix" to expanding the network code is
                      unrealistic.

                      Philip

                      On Wed, Aug 24, 2016 at 9:28 AM, Joachim Saul <saul<at>gfz-potsdam.de> wrote:
                      Chad Trabant wrote on 20.08.2016 02:01:
                      I agree with Philip, the proposed network extension blockette has a fundamental problem regarding backwards compatibility. It is only backwards compatible in that it can be read, but critical information will be quietly lost until a large number of legacy readers are replaced (which will take a very long time). Until then, when using legacy readers, all of the functions of a network code (ownership identification, logical station grouping) are lost with many implications.

                      Hallo Chad,

                      what would be "a very long time"?

                      First of all note that most of the current infrastructures world-wide
                      will not be affected by the blockette-1002 extension at all. The reason
                      for this is that most institutions will simply not produce any data with
                      1002 blockettes because they don't need the extended attributes. They
                      will continue to produce and exchange 2.4 MiniSEED just as they have
                      been for many years. They will not have to upgrade their station
                      hardware/software in order to produce up-to-date, valid MiniSEED. NO CHANGE!

                      Of course, "most institutions" is not necessarily all and sooner or
                      later data with blockette 1002 will start to circulate. This will
                      require blockette-1002 aware decoders to make use of the extended
                      attributes.

                      The obvious question is now: How much time would it take to update
                      libmseed, qlib, seedlink et al. to support blockette 1002? A week? A
                      month? A year? A very long time?

                      As soon as blockette-1002 aware versions of said libraries are
                      available, the software using them needs to be re-compiled and linked
                      against them. A lot of software if not most is going to be
                      blockette-1002 enabled that way, without need for further modifications.
                      And, very importantly, the software can be made blockette-1002-ready
                      WELL IN ADVANCE of the actual circulation of blockette-1002 data!

                      This means specifically: If a consensus about the blockette 1002
                      structure can be found, say, by December (e.g. AGU), then the work to
                      make libmseed, qlib, seedlink et al. blockette-1002 ready and
                      subsequently the software that uses them will take at most a few more
                      months. With an updated libmseed, software like ObsPy and SeisComP will
                      support at least the extended attributes out of the box. I haven't
                      looked at the PQLX details but since it also uses libmseed to read
                      MiniSEED, a blockette-1002-ready libmseed should allow the transition
                      will very little (if any) further effort. I am therefore sure that most
                      relevant, actively maintained software can likewise be made
                      blockette-1002 ready before the Kobe meeting.

                      There are, of course, details that need to be addressed. For instance,
                      the proposed 4-character location identifier and how it is converted to
                      Earthworm's tracebuf format, as pointed out by Dave. But these problems
                      would be the same for blockette-1002 MiniSEED and the proposed new format.

                      You can easily imagine older data converters being used for a long time and the expanded network code going missing right away.

                      Older data converters WILL continue to work fine with all currently
                      existing MiniSEED streams. Whereas NO older data converters will work
                      with ANY data converted to the proposed new and entirely incompatible
                      format!

                      I predict it wouldn't take very long before network 99 shows up in publications.

                      This implies authors who don't have a clue about what a network code is.
                      How would they be able to correctly use a network code? That's not an
                      issue of data formats but channel naming in general.

                      I do not believe assertions that all users of SEED will think it obvious what is going on with network 99. The grad student doing their work with an old version of PQLX is simply not going to know.

                      Why not inform the grad student? What does it take for the grad student
                      to learn that in an FDSN network code context "IU" doesn't stand for
                      "Indiana University"?

                      http://www.fdsn.org/networks/detail/IU

                      That's all! In case that grad student happens to stumble upon "99" then
                      probably an explanation on http://www.fdsn.org/networks/detail/99 would
                      help him or her.

                      As Philip says, it'd be better to break things than quietly continue to work while losing network identifiers.

                      What do you mean by "things"? The proposed new format and its
                      implementation would not just break the grad student's PQLX but it would
                      break ENTIRE INFRASTRUCTURES. World-wide and from bottom to top!

                      Do you want to disrupt the entire FDSN data exchange to protect the grad
                      student using an old PQLX from getting a "99" network code? Is that what
                      you are saying?

                      Furthermore, even this small update would require modifications to all software chains,

                      You have a position and are trying your best to defend it. This is
                      legitimate of course. But are exaggerating minor problems in order to
                      discredit an approach that you cannot deny would be a lot less
                      disruptive and expensive than the proposed new format.

                      from data generations

                      No modifications are needed at the stations. Stations continue to
                      produce 2.4 MiniSEED which will remains valid. There is no need to
                      produce blockette 1002 except for stations that e.g. have extended
                      network or location codes. There will not be many (it any) in currently
                      existing networks.

                      to data centers

                      Data centers are the ones that benefit most from a continuity that the
                      blockette-1002 approach would allow because they neither need to recode
                      entire archives nor have to provide "old" and "new" data formats in
                      parallel.

                      to users

                      Only users that actually use blockette-1002 data. If these users use
                      up-to-date versions of actively maintained software such as ObsPy,
                      SeisComP or MiniSEED-to-SAC converters they will not notice any
                      difference. Legacy software will continue to work with the exception of
                      the network code that will show up as "99".

                      along with database schemas, protocols, etc., etc.

                      There are some cases where updates will require further efforts. We
                      already read about Earthworm and the limited space for the location
                      identifier in the current Tracebuf2 format. But the effort at the
                      Earthworm end to accommodate a longer location identifier would be the
                      same for blockette-1002 data as for the proposed new format. It is
                      therefore understandable that the Earthworm community has reservations
                      against an extended location code because it would have to pay the price
                      for something it probably doesn't need.

                      In general chances are high that most database schemas will remain
                      unaffected as well as most protocols.

                      But I am curious to hear about specific database schemas that would be
                      more difficult to update to blockette-1002 MiniSEED than to the proposed
                      new format.

                      That is a huge amount of work for such a small change.

                      I hope to have pointed out by now that the work required to implement
                      blockette 1002 would in fact be dramatically less compared to the work
                      required to upgrade entire infrastructures (indeed from the data loggers
                      all the way to data users) to a fully incompatible new format.

                      And now we are back at the beginning of this conversation that started in ~2013.

                      What conversation are you referring to?

                      Cheers
                      Joachim

                      ----------------------
                      Posted to multiple topics:
                      FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
                      FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

                      Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
                      Update subscription preferences at http://www.fdsn.org/account/profile/

                      • Philip Crotwell schrieb am 24.08.2016 um 15:55:
                        Just like to point out that merely upgrading a library, like libmseed,
                        to parse a new blockette does not make suddenly make older software
                        compatible with a longer network code.

                        The structure in libmseed that holds the record header attributes is 'MSRecord'. If the decoder of an updated libmseed sees a blockette 1002 it will have to take the information about the network code etc. from there and populate the MSRecord accordingly. That's all. The software will then use or copy the content of MSRecord.network, which by the way is large enough already (10 characters plus '\0') to accommodate the extended network code.

                        Mission accomplished! Well... mostly.

                        If the software itself is not
                        also upgraded to use the information in the new blockette then the new
                        information is effectively ignored.

                        There will of course be target data structures in which the network code is hard-coded to be only two characters long. In such cases (hopefully) only two characters are copied. I haven't found a software in which this would be an actual issue. There *is* a similar issue, though, with the extended location code and the Earthworm Tracebuf2 structure. This will be a pain to solve within the Earthworm community but neither the blockette 1002 nor the proposed new format can be blamed for it. It's a limitation of Earthworm that is due to the current SEED channel naming conventions.

                        ObsPy, SeisComP, SAC, to name a few, would have no problem at all to accommodate the extended attributes. This is probably true for most other actively maintained software that uses either libmseed or qlib.

                        I feel that this idea that there
                        is a non-disruptive, easy "fix" to expanding the network code is
                        unrealistic.

                        There will never be a solution involving zero effort.

                        The question is how much effort each of the proposals would require. The blockette-1002 solution would be by far the easiest to adopt. But most importantly, existing infrastructures not requiring extended headers will not be disrupted at all. In other words: all existing real-time data exchange world-wide can continue to work as it does now. This allows enough time to upgrade software to support blockette 1002 and once blockette-1002 data actually start to circulate, most software infrastructures should be able to handle it properly.

                        Cheers
                        Joachim

                • Hello,

                  I've read most of the topics and I'm inclined to follow Joachim and
                  Andres comments about the change cost, specially regarding the network
                  operators and people who use data for operational stuffs (as David
                  pointed also).
                  I'm not worried for PhD students and those who work on off-line data.

                  And I understand the concerns about having a reserved network code like
                  99 being incorrectly used in publication because of legacy software not
                  reading the extended one.

                  So, all this preamble to simply ask the following question.
                  Will the extended network code be reserved for temporary networks ?
                  Or will it also be available for new permanent networks as soon as
                  adopted ?

                  I ask this because if it's only for temporary networks, then we can
                  have more time to migrate all operationnal stuff.
                  On the contrary, if it's also available for permanent networks, we very
                  soon would see new permanent stations not being used by most of
                  operational entities (I'm thinking right now about Tsunami Service
                  Providers, global location and/or CMT providers) because their software
                  doesn't support more than two letter network codes.
                  Think that some Earthworm modules have been made location code
                  compatible only one or two years ago and some of their users have not
                  migrated to use those new modules !!

                  Regards.

                  Jean-Marie SAUREL.

                  Le 19.08.2016 15:15, andres<at>gfz-potsdam.de a écrit :
                  On 08/19/2016 04:37 PM, Philip Crotwell wrote:
                  Just want to point out that a new blockette with extended network
                  code
                  is NOT backwards compatible. Old software that does not recognize
                  the
                  new blockette (and therefore likely ignores it) will report it
                  successfully read the data, but will attribute new data records to
                  the
                  wrong network. It may appear that this is a lower cost, however this
                  would generate a new class of bugs that would likely be subtle and
                  would persist for decades to come. There is pain in both ways, but I
                  would much prefer a system that fails obviously when it fails to one
                  that seems to work but actually is wrong infrequently and in a way
                  that is hard to notice.

                  A failure that looks like a failure gets fixed quickly, a failure
                  that
                  looks like a success can easily persist for a long time, causing
                  much
                  more damage in the long run.

                  A special 2-letter network code can be reserved. AFAIK there are even
                  some obvious network codes, such as "99" or "XX" that have never been
                  used. If data records are attributed to network "99", it is quite
                  obvious what is going on. Yet, if I use my old PQLX to quickly look
                  at
                  the data, I don't care about the network code.

                  Wasn't the network code added in SEED 2.3 in the first place? Any
                  issues
                  known?

                  Regards,
                  Andres.

                  ----------------------
                  Posted to multiple topics:
                  FDSN Working Group II
                  (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
                  FDSN Working Group III
                  (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

                  Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
                  Update subscription preferences at
                  http://www.fdsn.org/account/profile/

                  --
                  --------------------------------------
                  ICG-CARIBE EWS WG1 chair
                  Institut de Physique du Globe de Paris
                  Observatoire Volcanologique et Sismologique
                  1 rue Jussieu
                  75005 Paris

              • Philip Crotwell wrote on 19.08.2016 16:37:
                Just want to point out that a new blockette with extended network code
                is NOT backwards compatible.

                As I wrote before, it *is* backward compatible with the *existing* MiniSEED, which is *all* MiniSEED currently existing in *all* archives. I didn't write "blockette-1002 MiniSEED", because it is obvious that attributes specific to blockette 1002 need to be retrieved from there.

                The *only* compromise w.r.t. backward compatibility occurs if a blockette-1002 unaware software reads blockette-1002 MiniSEED. That is the price tag of the alternative solution. A minimal cost compared to, e.g., the recoding of entire data archives and disruption of complex data infrastructures. And as soon as that previously blockette-1002 unaware software is linked against an updated libmseed or qlib, the problem is gone anyway. In fact, for many data centers and infrastructures, the cost will be close to zero in practice.

                Actually an updated libmseed or qlib would be made available long before the first blockette-1002 MiniSEED data actually start circulating publicly. Therefore all actively maintained software can be made 1002-ready well in advance.

                Regards
                Joachim


            • Hi Joachim,

              At the IRIS DMC we have thought quite a bit about the costs of a transition to a newer generation of miniSEED. In many respects I think the DMC has more at stake in terms of operational change than any other single group in the FDSN. This is a discussion intended to develop a proposal for the FDSN to consider in 2017, only after which an adoption plan can be finalized. Personally, depending on the transition discussions, I would be surprised if we have much traction on adoption by 2018, it could easily be longer.

              The transition of SEED data to use new identifiers as outlined in the alternative proposal presented on July 8th by Angelo Strollo would also require most of the same data systems (data producers, middle-ware, data centers, user software) to be updated, which would take a long time. Also, until such time that most software has been updated we risk losing any extended network identifiers. The implication that we would simply add a new blockette, update a few libraries and the transition is over seems very unrealistic to me. Furthermore, that is a lot of cost to address a single issue in SEED.

              Chad

              Chad Trabant schrieb am 19.08.2016 um 08:58:
              The idea was left out of the straw man because it's a pretty radical change from current miniSEED where each record is independently usable. Lots of existing software would require significant redesign to read such data.

              Thank you, Chad, for addressing an important point: the costs of the new
              format!

              Do you have a rough idea about what the costs of the transition to an
              incompatible new data format would be? Reading this discussion one might
              get the impression that the transition would be a piece of cake. A
              version change, a few modified headers, an extended network code plus a
              few other improvements like microsecond time resolution. Hitherto
              stubborn network operators will be forced not to use empty location
              codes. But all these benefits will come with a price tag because of the
              incompatibility of the new format with MiniSEED.

              So what will be the cost of the transition? Who will pay the bill? Will
              the costs be spread across the community or will the data centers have
              to cover the costs alone?

              There are quite a few tasks ahead of "us". "Us" means a whole community
              of data providers, data management centers, data users, software
              developers, hardware manufacturers. World-wide! I.e., everyone who is
              now working with MiniSEED and has got used to it. Everyone!

              Tasks will include:

              * Recoding of entire data archives

              * Software updates. In some cases redesign will be necessary, while
              legacy software will just cease to work with the new format.

              * Migrate data streaming and exchange between institutions world-wide.
              It is easy to foresee that real-time data exchange, which was pretty
              hard to establish in the first place with many partners world-wide, will
              be heavily affected by migrating to the new format.

              * Request tools: will there be a deadline like "by August 1st, 2017,
              00:00:00 UTC" all fdsnws's have to support to the new format? Or will
              there be a transition? If so, how will this be organized? Either access
              to two archives (for each format) will be required or the fdsnws's will
              have to be enabled to deliver both formats by conversion on the fly?

              * Hardware manufacturers will have to support the new format.

              * Station network operators will have to bear the costs of adopting the
              new format even though it may not yield any benefit to them.

              I could probably add more items to this list but thinking of the above
              tasks causes me enough headaches already. That's the reason why I am
              publicly raising the cost question now because the proponents of the new
              format must have been thinking about this and probably have some idea
              about how costly the transition would be.

              Speaking of costs I would like to remind you of the alternative proposal
              presented on July 8th by Angelo Strollo on behalf of the major European
              data centers. They propose to simply introduce a new blockette 1002 to
              accommodate longer network codes but with enough space for additional
              attributes such as extended location id's etc. This light-weight
              solution is backward compatible with the existing MiniSEED. It is
              therefore the least disruptive solution and minimizes the costs of the
              transition.

              Regards
              Joachim

              ----------------------
              Posted to multiple topics:
              FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
              FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

              Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
              Update subscription preferences at http://www.fdsn.org/account/profile/


      • Hi Dave,

        I also think variable length records is a really bad idea. I prefer fixed length records on power of two boundaries for a variety of reasons. Mostly it permits more rapid accessing of the data without having to build extensive indices for each data block.


        Can you share some of the other reasons?

        I get the rapid access reasoning I think. As I've heard it described where one makes some educated guesses about where the data are in a file and skips around until you zero-in on the correct record(s).

        The notion of a variable record length has been raised a number of times in the past, we finally added it to the straw man for these reasons:
        a) In many ways it is a better fit for real time streams. No more waiting to "fill a record" or transmitting unfilled records, latency is much more controllable without waste. Also, data are usually generated at a regular rate, if one would like to package and transmit them at a regular rate with compression the output size is not readily predictable.

        b) Adjustments to records such as adding optional headers become much easier. In 2.x miniSEED if you wanted to, for example, add a blockette but there is not enough room you are stuck with re-encoding the data into unfilled records or reprocessing a lot of data to pack it efficiently.

        I'm on the fence with this one and would appreciate hearing about any other pros and cons regarding variable versus fixed record lengths.

        thanks,
        Chad


        On Aug 17, 2016, at 11:18 AM, David Ketchum <dckgov<at>stw-software.com> wrote:

        Hi,

        My two cents is that the permitted length should be kept fairly small so 65k should be fine. I do not know how many times I have dealt with formats like SAC which can store a large time series segment with only a single timestamp for the first sample and have the time of the last sample be inaccurate because the digitizing rate is either not constance or is “slightly off”. Smaller record sizes forces more frequent recording of timestamps and improves timing quality.

        I also think variable length records is a really bad idea. I prefer fixed length records on power of two boundaries for a variety of reasons. Mostly it permits more rapid accessing of the data without having to build extensive indices for each data block.

        Dave

        On Aug 11, 2016, at 5:49 PM, Philip Crotwell <crotwell<at>seis.sc.edu> wrote:

        Hi

        I think there should be a separation from what a datacenter permits in
        its ingestion systems and what is allowed in the file format. I have
        no problem with a datacenter saying "we only take records less than X
        bytes" and it probably also makes sense for datacenters to give out
        only small sized records. However, there is an advantage for client
        software to be able to save a single continuous timespan of data as a
        single array of floats, and 65k is kind of small for that. I know
        there is an argument that miniseed is not for post processing, but
        that seems to me to be a poor reason as it can handle it and it is
        really nice to be able to save without switching file formats just
        because you have done some processing. And for the most part,
        processing means to take records that are continuous and turn them
        into a single big float array, do something, and then save the array
        out. Having to undo that combining process just to be able to save in
        the file format is not ideal. And keep in mind that if some of the
        other changes, like network code length, happen, the existing post
        processing file formats like SAC will no longer be capable of holding
        new data.

        And in this case, the save would likely not compress the data, nor
        would it need to do the CRC. I would also observe that the current
        miniseed allows records of up to 2 to the 256 power, and datacenters
        have not been swamped by huge records.

        It is true that big records are bad in certain cases, but that doesn't
        mean that they are bad in all cases. I feel the file format should not
        be designed to prevent those other uses. The extra 2 bytes of storage
        to allow up to 4Gb records seems well worth it to me.

        thanks
        Philip


        On Thu, Aug 11, 2016 at 4:00 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:

        Hi all,

        Change proposal #12 to the 2016-3-30 straw man (iteration 1) is attached:
        Reduce record length field from 4 bytes to 2 bytes.

        Please use this thread to provide your feedback on this proposal by
        Wednesday August 24th.

        thanks,
        Chad





        ----------------------
        Posted to multiple topics:
        FDSN Working Group II
        (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
        FDSN Working Group III
        (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

        Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
        Update subscription preferences at http://www.fdsn.org/account/profile/


        ----------------------
        Posted to multiple topics:
        FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
        FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

        Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
        Update subscription preferences at http://www.fdsn.org/account/profile/


        ----------------------
        Posted to multiple topics:
        FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
        FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

        Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
        Update subscription preferences at http://www.fdsn.org/account/profile/


        • Chad,

          For instance the Edge/CWB software takes advantage of fixed length or at least power of two to store all channels of miniSEED in one files where regions of the file are reserved for each channel (generally 64 512 byte blocks). It can accommodate the 2.4 miniSEED of any size, but the blocks fit nicely into their extents and the indexing to find a channel and time range uses an index which only has to index the extents. This powerfully speeds up access to queried data. I know a lot of the work uses “file per channel per day”, but we found that pretty inefficient. I know that many use the binary search method you mentioned, which also works better on fixed length blocks.

          I do not particularly think the miniSEED is a very good choice for telemetry when short latency is desired - like for earthquake early warning. The fixed part of the header is so big relative to the pay load that it is not bandwidth efficient. If variable length records are desired for this, I think the alternative of using another telemetry format that is more efficient should win out. Note the current Q330 one second packets are not in miniSEED form but they are fairly efficient and variable length. The receiving software takes this format an generates miniSEED. The one second packets are available for the EEW and the miniSEED is generated for later use and archival. My take is that miniSEED 3 should not try to be a telemetry format as it would be a bad one - it is a standard format used at datacenter and after the realtime processing is done. Further the telemetry format is a competitive function best left to the digitizer vendors. We should insist their telemetry format make good MiniSEED 3 including all of the mandatory and optional flags etc, but how they achieve that should be left to them.

          Dave



          On Aug 19, 2016, at 5:38 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:

          Hi Dave,

          I also think variable length records is a really bad idea. I prefer fixed length records on power of two boundaries for a variety of reasons. Mostly it permits more rapid accessing of the data without having to build extensive indices for each data block.


          Can you share some of the other reasons?

          I get the rapid access reasoning I think. As I've heard it described where one makes some educated guesses about where the data are in a file and skips around until you zero-in on the correct record(s).

          The notion of a variable record length has been raised a number of times in the past, we finally added it to the straw man for these reasons:
          a) In many ways it is a better fit for real time streams. No more waiting to "fill a record" or transmitting unfilled records, latency is much more controllable without waste. Also, data are usually generated at a regular rate, if one would like to package and transmit them at a regular rate with compression the output size is not readily predictable.

          b) Adjustments to records such as adding optional headers become much easier. In 2.x miniSEED if you wanted to, for example, add a blockette but there is not enough room you are stuck with re-encoding the data into unfilled records or reprocessing a lot of data to pack it efficiently.

          I'm on the fence with this one and would appreciate hearing about any other pros and cons regarding variable versus fixed record lengths.

          thanks,
          Chad


          On Aug 17, 2016, at 11:18 AM, David Ketchum <dckgov<at>stw-software.com> wrote:

          Hi,

          My two cents is that the permitted length should be kept fairly small so 65k should be fine. I do not know how many times I have dealt with formats like SAC which can store a large time series segment with only a single timestamp for the first sample and have the time of the last sample be inaccurate because the digitizing rate is either not constance or is “slightly off”. Smaller record sizes forces more frequent recording of timestamps and improves timing quality.

          I also think variable length records is a really bad idea. I prefer fixed length records on power of two boundaries for a variety of reasons. Mostly it permits more rapid accessing of the data without having to build extensive indices for each data block.

          Dave

          On Aug 11, 2016, at 5:49 PM, Philip Crotwell <crotwell<at>seis.sc.edu> wrote:

          Hi

          I think there should be a separation from what a datacenter permits in
          its ingestion systems and what is allowed in the file format. I have
          no problem with a datacenter saying "we only take records less than X
          bytes" and it probably also makes sense for datacenters to give out
          only small sized records. However, there is an advantage for client
          software to be able to save a single continuous timespan of data as a
          single array of floats, and 65k is kind of small for that. I know
          there is an argument that miniseed is not for post processing, but
          that seems to me to be a poor reason as it can handle it and it is
          really nice to be able to save without switching file formats just
          because you have done some processing. And for the most part,
          processing means to take records that are continuous and turn them
          into a single big float array, do something, and then save the array
          out. Having to undo that combining process just to be able to save in
          the file format is not ideal. And keep in mind that if some of the
          other changes, like network code length, happen, the existing post
          processing file formats like SAC will no longer be capable of holding
          new data.

          And in this case, the save would likely not compress the data, nor
          would it need to do the CRC. I would also observe that the current
          miniseed allows records of up to 2 to the 256 power, and datacenters
          have not been swamped by huge records.

          It is true that big records are bad in certain cases, but that doesn't
          mean that they are bad in all cases. I feel the file format should not
          be designed to prevent those other uses. The extra 2 bytes of storage
          to allow up to 4Gb records seems well worth it to me.

          thanks
          Philip


          On Thu, Aug 11, 2016 at 4:00 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:

          Hi all,

          Change proposal #12 to the 2016-3-30 straw man (iteration 1) is attached:
          Reduce record length field from 4 bytes to 2 bytes.

          Please use this thread to provide your feedback on this proposal by
          Wednesday August 24th.

          thanks,
          Chad





          ----------------------
          Posted to multiple topics:
          FDSN Working Group II
          (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
          FDSN Working Group III
          (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

          Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
          Update subscription preferences at http://www.fdsn.org/account/profile/


          ----------------------
          Posted to multiple topics:
          FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
          FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

          Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
          Update subscription preferences at http://www.fdsn.org/account/profile/


          ----------------------
          Posted to multiple topics:
          FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
          FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

          Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
          Update subscription preferences at http://www.fdsn.org/account/profile/