2022 Meeting Notes#

2022-06#

  • Facilitator: Yuvi

  • Notetaker: Yuvi

  • Timekeeper: Chris

Onboard James!#

  • Issue: https://github.com/2i2c-org/meta/issues/337

    • July 1 as first day!

    • Trying to pay attention to issues, slack, email & join in wherever.

    • Wears more than 1 hat, taking off hats as appropriate. Intent is to be generally available for most 2i2c things

  • Interested in ‘making research & education delightful’

    • Documentation, training, onboarding!

    • Nature of research & edu is that we keep having new cohorts of people who need to be onboarded!

    • Participating in project Pythia hackathon working on documentation (xarray, dask, etc)

  • Figuring out and joining different communities!

    • pangeo is a big one (one of the ones that 2i2c started to operationalize!)

    • Finding names of other communities we’re a part of! Finding their names, connecting on their communities, etc.

    • Registered for Scientific community management program (along with Sarah!)

    • Want to enable other community managers too! So they feel empowered to do the things they need todo

  • Product side!

    • Right to Replicate is core value of 2i2c

    • Some experience in running helm, k8s, z2jh, etc.

    • Has been a while, but am going to go through 2i2c’s docs and see if it ‘works’!

    • Fill in holes as it’s needed, get a feel for product as it is

    • Run it on something like minikube to test it out

  • Figure out what a 3-6 month strategy for division of product & community would look like

    • Already collecting with Jim on the partnership & sales side of it

  • Where are the sources of truth? Other ones James hasn’t explored?

    • Some on google drive

      • Figuring out how to use team shared drives

      • Google offers it for free for non profits, but because we’re under CSS they might already have it

      • Chris is going to figure out a shared Google Drive setup.

Language and naming around our managed hub service#

  • Issue: https://github.com/2i2c-org/infrastructure/issues/1474

  • New diagrams of our service model from a recent CZI grant: https://docs.google.com/presentation/d/12Xkpin0sNwcBygArFUSalkrgfONgaRrfUYokyfIWWh8/edit

  • What do you call the process of what we provide and communicate that to communities?

    • How do you bring in the concept of that shared responsibility so we don’t pigeonhole ourself as a SaaS?

  • ‘Managed’ implies 2i2c does ‘everything’ which is not true - we run it collaboratively

    • As a user, being able to have control over the user experience without requiring that I leave.

    • Open source communities often run into these issues

    • Don’t want to create two “classes” of people

    • Need to create community processes to help people do this

  • Right to Participate is important

  • Words to describe this

    • Managed JupyterHub service was picked because cloud service was managed by us, so others won’t have to worry about it

      • We call it JupyterHub to give people the idea that it is an open source component, just using OSS tools

    • One proposal: “Collaborative Hub”

      • But feels like this refers to the artifact (e.g. “collaborative editing”) not the process.

    • Make up a new word? (a neologism)

      • Collaborative hub model can be part of the bullet points, can be a brand new thing.

      • Are we providing jupyterhub, and our product is jupyterhub? Or is it JupyterHub+ other stuff? not sure what to call it.

Constraints / guiding principles to define this service#

  • Key words / ideas

    • Unburdening - “We do work so you don’t have to”

    • Participatory / Collaborative - “We invite your participation at several levels if you wish”

    • Transparency - “All our work is done in the open”

    • Reciprocity - “We do most of our work via upstream improvements”

    • “Levels of Participation” and “Shared Responsibility” are both aspects of the service. Describe how we have a range of collaboration options.

    • Putting many concepts into a single name is probably not possible, so we should choose a name that is not misleading and explain the details with extra documentation.

    • Contribute to a pre-existing large multi-stakeholder community

  • Collaborative, participatory service

  • Erik: Transparent operations

  • Sarah: Unburdening (traditionally, this is ‘put in a postdoc who has no k8s knowledge’)

  • Chris: Participatory is an important one, the ways in which the service is trying to take inspiration from healthy inclusive OSS communities.

    • Notion there is OSS as a license, you can see all the code and you can fork it! But the dynamics of production are a single team of people (docutils) where they don’t care about increasing participation.

    • But participatory is something like JupyterHub itself, where there are pathways to getting in and increasing the set of people who are participating.

    • Tension between participatory and unburdening, as we’re asking for labor for other organizations.

  • Sarah: “Levels of participation”, which we do have.

    • We are running a managed kubernetes service & a collaborative hub service.

    • We don’t get specific k8s feature requests because they don’t know what they want but they do ask for specific hub features!

    • It’s our job to figure out if it’s a k8s feature or not.

    • Extra complicated to put all this in a single term as it’s a spectrum of variation of input as we go down our big stack

  • Damian: Trying to have all these complex concepts working at different levels in one name is impossible.

    • So maybe we just have a name for the thing we are offering. And a description of what we are offering, otherwise it’ll be a bit difficult to come up with a name.

    • We are providing JupyterHub+ other things, we are offering an opinionated way to deploy hubs on the cloud. This opinionated way is the product itself.

  • Chris: Signal boost communities compare services by end product (features, end product I get, etc) and 2i2c needs to sell that the process by which you get it is as important as the product you are selling itself. Not intuitive for a lot of communities that are used to the producer / consumer pigoenhole that tech tends to create. Process is important!

  • Damian: Right to participate, evolving and getting it written will help us fight the battles against University IT, etc. Will help us get into that conversation at the level where we don’t end up in the situation wrt columbia credits again.

  • Yuvi: “Upstream” and “Upstream contributions” is another important part of this. It is a differentiator between what we do and many other startups. We contribute to a pre-existing large multi-stakeholder community.

Discuss making our operation and dev items assignments being two people by default#

  • https://github.com/2i2c-org/team-compass/issues/444

  • How do we managed inefficiencies in the pairment

  • e.g., time zones and bottlenecks

  • One ‘lead’ developer who can work on the project and a ‘companion’ person who can help iterate quickly

  • We did this with sarah & erik for the CI/CD revamp! Worked really great, showcased the pattern is pretty good.

  • Trying to make this more official

  • Damian is trying to make requests for new hubs to use pairs of people, as these are ‘must do’

  • Mismatch of tz makes things a bit inefficient.

  • Should come up with teams that might change composition, that collaborate together. But how to deal with tz?

  • We should adopt pairing in almost all our projects!

  • Chris: What are the 2/3 main problems we’re trying to resolve with pairing?

    • People start working on projects because they are assigned to work on those, but they are not getting feedback as soon as they are helpful. Lots of conversations / PRs open for days and days, but not because people don’t like to provide feedback. Need to have interactive discussion to push things forward. Pairing tries to force this.

    • We have a lot of complexity in our stack! Makes it easier to push forward because things can be complicated.

    • Prevent siloing our knowledge about the idiosynchrasies of the infrastructure and of skills in general.

Action items#

  • Need to get credentials for James

    • Chris is championing the onboarding, Sarah has offered to help with the technical bits

2022-05#

  • Facilitator: Sarah

  • Notetaker: Erik, Jim

Job hire update/decision making/discussion#

  • Proposal document is here

    • Brief description of current candidates

    • Describe suggestion to make an offer

    • Question: Any objections to making the offer?

Make an offer:

  • CH: meeting goal: to get a feeling one way or another to provide a job offer for a candidate.

  • Important that we document the process we used.

  • DA: We have consensus on hiring top candidate? +1s all around…

Additional option:

Next steps (CH)

  1. offer position to top candidate. CH will send email soon!

  2. do a budget analysis to look into other hires

Matlab as a service?#

  • Matlab support issue

    • Is this mis-aligned with our mission or values?

      • Erik: a wish for clarification of the goal, “intergration with matlab” is vague

    • Should we offer this support to OpenScapes?

    • Estimates on implementation cost?

    • JC: How to handle licensing?

      • This was very difficult when they did this at PIMS

      • Instead they offered an Octave kernel inside a JupyterHub

      • Maybe this is what we should do here.

Discussion:

  • what does this mean?

  • SG: Binder supports Octave.

  • costs associated with license management should be passed on to openscapes

  • Berkeley runs a hub with Matlab

    • It is complex to run because of the licensing

    • It does help bring users on.

    • Ryan Lovett is the one that runs it.

  • What should this cost?

    • Premium for “non-open” special-casing

    • Premium for extra work we’d have to do to support this.

    • Premium for the uncertainty in doing this given we haven’t done it before.

  • YP: Don’t have an ideological problem with this plan, Matlab is just part of their image.

  • See this: https://github.com/mathworks/jupyter-matlab-proxy

  • ES: Who manages the Openscapes image? How does this incur additional work? SG: How to handle support tickets?

  • CH: One way to see how serious they are about this request is to make it cost a lot of money.

    • They should figure out the licensing costs

    • Not aligned with the open source vision

    • estimate a premium on extra special casing

    • uncertainty in this scenario so should be a premium cost

    • just charge given the other things we might prioritize higher.

  • DA: I am worried about the support piece of this plan.

  • YP: They can do what they want with their image. Key distinction: Openscapes can do it vs. 2i2c helps Openscapes do it.

  • [x] Jim to reply to the issue. Open tickets on support instead of inside the infrastructure repo. We can provide faster responses.

Support process updates#

  • PR here

  • Brief overview of process

  • Any concerns or feedback?

  • Any objections to trying this?

Action: Engineering team to review and provide feedback.

Metadocencia / Carpentries colab?#

Discussion:

  • JC: Callysto and Metadocencia have overlapping missions. 2i2c could catalyze collaboration among Callysto + Metadocencia + Carpentries. This has potential.

  • CH: a person was worried about commercial cloud use, how does 2i2c currently think about third-party clouds etc?

    • CH: Users worried about commercial clouds could come try at 2i2c, and by doing so, learn more about the deployment and then transition with the R2R.

    • SG: considers digital ocean as another cloud we could support

    • SG: partnership model involving training could be possible, to help them grow their capability of doing what 2i2c does in their own cloud

    • CH: a large amount of funding is potentially available

    • DA: supports that there is a need for non-commercial clouds

    • CH: asks DA to connect further

    • DA: is happy to connect

2022-04#

  • Facilitator: @georgiana

  • Notetaker: @choldgraf

Damian as a project manager (20min)#

  • Proposal for Damian to serve as a 50% service manager as a trial run.

  • More background here: https://github.com/2i2c-org/team-compass/issues/398

  • DA:

    • This isn’t going to be a huge change to our practices

    • Initially the focus is on the practices we’ve already discussed/agreed to.

    • We can evolve it over time as well

  • GD:

    • I think this is a great idea and thank you to Damian!

    • Concern: would this disrupt Damian’s workflow? How can we balance his 50% time?

      • We should keep track of this over time to make sure it’s sustainable for Damian

      • We should also be supportive of Damian in this role, and be on the lookout for whether it seems unsustainable for him

    • What’s the overlap with the Product/Community Lead?

      • Product/Community Lead is more external-facing, and the Project Manager is more internal/team-facing.

      • Project Manager should oversee the team “process” and make sure we’re following it.

Support steward and team roles discussion (30min)#

  • We need to adjust the frequency of these so they don’t overlap

  • Suggestion to rotate support steward based on timezone rather than alphabetical (ref: github issue)

  • Is it realistic to have a dedicated operations person

    • Proposal [Damián]

      • Since we have a 2 person team at support, let’s make one of them the level-1 and the other one is level-2.

      • You are level-1 in the first two weeks, level-2 in the next two weeks.

      • level-1 would be the client-facing (you only redirect things) and level-2 would be the dedicated operation eng (still with support of the whole team, of course).

    • Questions

      • How would we choose who is in which role?

      • If one person is dedicated to operations issues, would this mean they don’t have capacity for development work during that time?

        • To a degree, this is already true. The Support Steward tends to feel like they need to fix the problem as well as do support.

      • This may not solve the geography problem. If you are on a European time zone

        • We need more engineering capacity on the west coast ultimately

      • What if some team-mates cannot perform these actions? e.g., non-engineer team members?

      • What about expectation setting? Can we tell communities we won’t guarantee response in < 24 hours?

      • If we quantified and tracked our “service level expectations” then it could make it easier to reduce stress associated w/ outages, and set expectations with other communities.

        • We need a definition of “uptime”

        • What are the “Key Performance Indicators” for service levels?

          • Server start-up times

      • Could we define specific times where we are in “ready mode” for a hub, time-box this so it is not indefinite, and ask community representatives to choose when they request this extra attention?

      • This would add some extra complexity to manage these processes.

        • But, we are already paying this penalty in “implicit” ways.

      • Chris proposal: we treat first-line support and first-line operations as two separate teams.

        • General agreement on this.

      • But still haven’t figured out the timezones issue

        • We really only care about the “major issues” when it comes to timezones”. Most things are not required to fix immediately.

        • This is where a pager can be helpful.

      • Incident Commander is a concept from SRE

        • When there is an outage, this person makes sure that it is resolved.

        • Used in “must be fixed immediately” situations.

        • Something like Pager Duty is a good tool to try out.

    • What are the 2-3 things we can do next to improve things?

      • Look at our support tickets so far, and define a “hierarchy” of ticket types / priority queue, and define an escalation pathway for each.

        • After we do this, next steps can be working out the details of how we actually address those priority levels.

      • Should we also prioritize monitoring infrastructure?

        • Yes, general agreement that this would be helpful.

        • Defining a metric for service reliability could be a part of this.

Team meeting length (10min)#

  • Is 90 minutes too-long for our monthly team meeting?

    • Proposal [Damián]:

      • Let’s have a monthly enjoy/informal meeting (30m, NO tech stuff, opt-in, open agenda)

      • Let’s have a monthly engineering meeting (1h, GENERAL tech stuff, all hands, ONLY agenda)

    • General agreement that this would be nice.

    • Could we block off the 90 minutes but still try to keep our meetings to 60 minutes?

      • Kirstie’s lab does this and it works nicely.

    • Proposal:

      • Team meetings are 60 minutes, but reserve a time slot for 90 minutes, and use the last 30 minutes as a “break glass in case of emergency” time to discuss.

      • Also find a way to make team space for informal conversation and chats with one another, but not as a part of the monthly team meeting.

Team meeting agenda coordination (5min)#

  • Our current agenda is split between a Google Calendar, a GitHub issue, and this HackMD.

  • We’ve hit confusion about syncing information across them etc.

  • Let’s simplify with this structure:

    • Do not use a dedicated issue for these meetings. Remove the issue template for team meetings.

    • The 2i2c Team Calendar is the source of truth for this meeting. We expect team members to keep up-to-date w/ this calendar to know when meetings will happen.

    • Meeting calendar events have one link, which is to the meeting agenda/notes.

    • In the agenda/notes, we keep all relevant information about the meeting (links, roles, etc…but not dates).

  • General agreement on this!

Will the Project Manager lead the sprint planning meetings now?#

  • Yes - the Project Manager can lead this meeting.

  • The “meeting facilitator” can serve in a support role for this meeting (not sure exactly what this means but we can figure it out).

2022-03#

  • Meeting facilitator: @erik

  • Meeting timekeeper: @damian

  • Meeting recorder: @erik

Quick updates#

Hub service growth plan#

  • Currently serving about 20-25 communities.

  • Large variability in size. Some as small as 2-3 (Catalyst Cooperative), some as large as 200-300 (Toronto)

  • Large variability in complexity. Some have vanilla data8 environments (community colleges), others have monumental Dockerfiles (Pangeo)

  • As we grow, we also create more work for ourselves.

  • At the end of our runway (say, 24 months), we need to have a sustainable model.

  • What’s the right way to intentionally grow so that we can refine our infrastructure and processes?

Proposal#

  • Phases of growth with fixed number of “slots” for communities in each phase.

  • Three major areas that we are working out

    • Infrastructure / tech

    • Process for operations and support

    • Sales and invoicing

  • For each area, define questions that we want to answer, or conditions that we want to meet, in order to trigger a new batch of communities that we’ll take on.

  • Write up a blog post about this plan and share it for feedback, even if we don’t intend on taking on any new communities for some time.

Questions to ask#

What should our target number of communities be? (for now, in 1 month, in 3 months?)

  • Currently we have around 15 basic hubs, 7 research hubs, 3 experimental hubs

  • We need different kinds of slots to help indicate the amount of work they involve:

    • The amount of usage

    • Customization to authenticators

    • Image complexity: entirely custom image, repo2docker based image

    • Hub adjustments: sidecar, gpu

    • Initial development efforts

    • Community knowledge: they offload us by avoiding issues in the first place, and being able to understand the complexity of what they need

What is in scope, what isn’t?

  • More standardized hubs -> we can support more communities. We need to have a grasp of what we aim for with regards to this.

    • Avoid one off features

  • Answer 1: Projecting into the feature is one way to answer

    • Sarah: very few new communities right now, because we are to handle a new hub type: binderhubs

What is our biggest source of risk?

If there were one thing you’d focus on to boost our capacity, what would it be?

  • What can constrain our growth?

    • unreliability in automation to deploy, it is a timesink and can also break your focus repeatedly while waiting

    • we have made experimental exceptions on what hubs should accomplish

      • Overviewing the experimental adjustments as hubs grow is a challenge

      • While we need to experiment to meet users needs, we need to work to bring back experimentational features to general features or have a practice for managing the experiemental features.

      • Some hubs tends to require more work than others, how do we handle such differences?

    • An engineer needs to gague the complexity of implementing new features and make a call if it is seen as sustainable to maintain.

    • We can be constrained by features only a subset of us understands how to maintain it:

      • “will this feature be something everybody needs to learn about in the end?”

      • We don’t want to constrain ourselves to the lowest common denominator of shared knowledge

    • lack of monitoring

Goals for the alpha#

The goal is a phased approach to growing, for that, we need to think more clearly about our growth goals and define them more explicitly.

Idea: to have a concept of capacity, and how we increase that over time.

How to improve team coordination or management#

  • Currently, we’re using our weekly check-ins + sprint planning + async issues to coordinate.

  • There are a few kinds of work that the engineering team is doing.

    • Development projects (often span multiple weeks, and self-driven)

    • Expected operations tasks (e.g., expected maintenance, setting up a hub, etc)

    • Operations and support tasks (aka, helping when something is wrong, more unpredictable and come with more urgency)

  • The last two things are tricky - they are more unpredictable, but also more important that we do them quickly and on schedule.

    • This is particularly important when interfacing with sales, since a sales person will make a commitment to running the hub by a certain date.

  • Ultimately, we need to be able to:

    • Give somebody a concrete date (or guaranteed turnaround time) when their issue is resolved / hub is deployed / etc.

    • Be confident in our team processes that we will definitely get it done by that date, or raise a flag to change plan.

  • Our current model assumes that Chris is keeping the context of what everybody is working on in his head, but this is not scalable.

    • Chris feels like he is dropping balls in guiding others, refining issues, etc.

    • Also hard because he doesn’t have the skillset of a cloud engineer.

Questions#

A few questions and ideas:

  • How can we improve our team processes to reduce the uncertainty around finishing support / operations issues?

    • Discussion:

      • Our incomming requests are hard to quickly create an action point to finish. Its often back and forth.

      • First line of support, second line of support

        • First line are generalists, they answer questions coming from all areas and help elicit relevant background information for the specialists

        • Second line of support are specialists

    • Answers:

      • The support steward acts as the first line of support

      • Training for first line of support!

        • Digital educational content?

      • Defining mechanisms on to ensure support tickets resolve, or perhaps having a dedicated role to drive operations

Follow-up items#

2022-02#

New job posting for product/community lead#

  • ref: https://github.com/2i2c-org/meta/issues/256

  • Role will be focused on maximizing the impact for communities using our infrastructure

  • A combination of community impact, training, and helping guide our service development

  • They should focus on the infrastructure from the user’s perspective, not an engineer’s, but work with engineering

  • Intro discussion:

    • CH: Background consideration: What major things would help any community? Currently, we don’t have capacity to communicate more directly to help the community make get most value out of the hubs. One part of this role is to spearhead the kind of strategy and efforts to help communities use the infra in the most effective way. Another part is to be a more direct contact person. In this way, the person would also help guide future development efforts by aquiring insights about the user experience.

    • CH: Skills relevant for the role:

      • Great communicator, good at teaching, enthusiastic.

      • Capable of leading the development of the role in itself

  • Questions:

    • What do others think about the scope of this role? Does it make sense?

      • ES: At Sandvik, it would have been helpful to have somebody dedicating time to teaching others how to make use of the service.

        • Collecting user experience data quickly rather than slowly can be critical.

        • “They’d be in a unique position to validate the value of certain features”.

        • Could this role write feature proposals for open source communities?

      • GD: Could help build trust with communities, by having a more human connection

    • How can we incorporate this role into our team planning processes?

      • Feature requests and shaping for our communities

      • They will need to talk to engineers to know what is easy vs. hard vs. impossible to implement

      • Could set set up a process to plan/prioritize our next work?

        • If so, this person could be involved in this discussion

        • They will have a good perspective on what to prioritize

Does needing a specific datacenter require a dedicated cluster?#

When communities want a hub to run in a particular datacenter, does this mean we must give them their own cluster? (10m)

  • ref: https://2i2c.slack.com/archives/C0206KLF76E/p1643677108490539

  • CH: A user wants a hub in a specific data center, but doesn’t need a dedicated cluster.

    • SG: This is probably to a large extent related to egress/ingress costs. We will need a new cluster in this case, but should we make it shared or dedicated to them?

    • SG: Of course, a shared cluster would imply we ensure to avoid bleeding credentials to access relevant resources between communities.

    • ES/CH/Yuvi: What about cost of egress and billing? It simplifies billing, because we can’t separate costs related to egress traffic.

    • YP: Networking performance gains to co-locate with data.

    • CH: If we were launching into separate clusters w/ the multi-cluster spawner, would that create a really huge list of user profiles people had to choose between?

      • SG: You can restrict the access that individuals have to different clusters, may also be able to restrict visibility

      • YP: Cluster-spawner would reduce complexity of the spawned clusters

      • CH: Want to avoid this: shared hub, 20 clusters to select from, 19 is totally irrelevant…

      • ES: Off topic relevant ide: provide ping times

      • ES: What are limitations for users of using a central cluster spawner to access a hub?

2022-01-04#

  • Meeting facilitator: @choldgraf

  • Meeting timekeeper: @choldgraf

  • Meeting recorder: @choldgraf

OpenScapes outage#

  • We might be thinking about this when it comes to the pangeo Binder

  • Would testing fix this?

    • In some cases, maybe, but not always

    • For hubs that want to have full access, testing would not be a fix

  • Scenario where somebody wants to have full access

    • A time-boxed workshop where you don’t want to hassle with manually adding a bunch of people

  • Could email reporting address this?

    • Having a single grafana that aggregates across the prometheus for each hub would help

    • Not clear how you’d be able to access

  • Could RBAC address this?

    • RBAC maps onto specific permissions in the hub, it’s a bit more complex for arbitrary permissions mapping

  • About more convenient authorization logic, an issue about a hook: https://github.com/jupyterhub/jupyterhub/issues/3490

  • Other project ideas

    • Erik: interested in Helm validation, but wants more consensus on what our helm config structure should be first.

    • Sarah: Standardize our Terraform structure across providers so that we minimize complexity associated with provider-specific stuff.

      • Figuring out a workaround for ConfigConnector. Scratch buckets + requestor pays access are two things that’d be enabled by this.

    • Chris: Help with migrating the Pangeo documentation. Can coordinate with Sarah to figure out which ones

  • Damian: Improving our staging and testing environments will make it easier to iterate and develop new things - this is also related to the Helm validation stuff that we discussed.