A | B | C | D | |
---|---|---|---|---|
1 | Governance as Code Theme | Proposed Hack | Looking for Teammates | DataHub Slack Handle |
2 | Active Metadata: supercharging data governance practices with streaming metadata | Can we leverage actions to facilitate communication between owners and consumers of data directly through datahub? End state: 1. Give users(and groups) the ability to "Subscribe" to a dataset(with varying levels of alerting) 2. Give owners of data ability to communicate with users and groups a. New tab for owners where they are able to post messages b. Messages are consumed by the actions framework c. Call the AWS email client(or slack) to send an email to all subscribers with the message Types of messages could include: * Table Deprecation * error with data * updates to data, kind of like a changelog * etc... | Yes | @kartik.darapuneni |
3 | Data Products and Data Mesh: governing the creation of data in schema-first ecosystems and beyond | Pro-active governance. Once dataset usage reaches a set volume of usage(through dataset statistics), if a datasets descriptions, field descriptions are unset. Email/message owner that dataset is in the companies "critical path," and that a set of defined governance rules are not met. | Yes | @kartik.darapuneni |
4 | Active Metadata: Supercharging data governance practices with metadata change events | Keeping metadata descriptions and documentations up to date is always hard. In addition to that, some data owners need help from business areas to describe some datasets and dashboards, and that is their last priority. So, in order to give business data owners a little "push" to maintain the metadata descriptions refreshed, blocking the datasets that aren't up to date from being refreshed would be a good motivation. To achieve that, the orchestration tool (such as Airflow) should have an Operator that can analyze the percentage of columns described for each dataset, and the ones that have this percentage under 80% would have its refresh job (or DAG) blocked and it wouldn't run. Of course, this is a bit too much...So Slack alerts could warn the data owners of these below-80-percent-described datesets for 3 or 4 days before blocking the DAG itself. | No | Rodolfo Shideki |
5 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | Ingest business glossaries from SKOS vocabularies. The dutch government publishes a lot of glossaries as SKOS vocabularies (e.g. https://definities.geostandaarden.nl/imgeo/nl/) I'd like to come up with an ingest recipe that imports the relevant business glossaries to Datahub so we can label our dataproducts according to national standards. | Maybe | Niels Hoffmann |
6 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | Add functionality to a postgres ingestion script to match tables/columns to business glossary terms. For at least some of our postgres datasets we have SHACL based ontologies that refer to SKOS vocabularies for definitions. I'd like to come up with an ingestion script that uses our SHACL ontologies to map ingested data to business glossary terms. | Maybe | Niels Hoffmann |
7 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | If we can govern hive schema in code, rather than Hive metastore(HMS), the benefits are that: 1. The users can add customized metadata on datasets and columns to represent business logic. 2. The users get the version control for free. 3. Customized validation on metadata before check-in the code. So we are going to let users codegen tables in HMS as files in a repo. Then add metadata, validate and check-in code. In the CI/CD time, ingest the files into Datahub. | No | Mingzhi Ye |
8 | Active Metadata: Supercharging data governance practices with metadata change events | Orchestration of metadata across platforms - create a federated metadata management system that makes these different tools and systems talk to each other, thus making data assets across these systems interoperable. | No | MetOrche |
9 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | Business metric tracking in real time and fraud transaction detection. We will use Machine Learning and benchmarking to measure and optimize the efficiency of the business processes. We will XGBoost for the fraud detection and various parameters of a transaction. | No | |
10 | Active Metadata: Supercharging data governance practices with metadata change events | "Leveraging Active Metadata to Tackle Security Concerns while using File Datastores like Google Drive" - Plan of Action: We're planning to build an add-on for Google Drive which will evaluate and scan every file's active metadata before it gets uploaded to the drive. This would act as an added security layer for file sharing. This add-on will be available via Google Workspace Marketplace. - Technologies: JavaScript jQuery | No | Karan Agrawal |
11 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | We plan to build a ML model that can predict success of newly launched products in the market. To define the success of a product the important factors to consider is the difference b/w business logic and technical data. To evaluate business performance it is necessary to start with product and future goal. Once this is defined a product metric should be assigned to evaluate success. Product metrics can vary with metrics such as number of users, attractive features or Click through rate of recommendation etc. Business performance at times can sound unsuccessful due to the acquired technical datasets but it is important to realise what are the metrics need to be optimised. And that will be the main goal of our project to optimise the metrics that are needed to get the precise results of a product that can stand firm in a market. | No | Prachi Nandi |
13 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | Companies that desire to adopt Datahub as a critical component in a Self Service Analytics culture need to follow metrics that help Data Governance Teams to measure their work. The Analytics dashboard only provides metrics that show Platform usage, but not metadata coverage. Due to this, our goal is to increase the amount of information provided on the Analytics Dashboard page to give more details about metadata coverage. Through this, we seek to answer questions such as: * How many tables/views have descriptions? * How many tables/views have defined owners? * How many data assets there're per tag? * How many ML Models have the technical owners defined? * How many tables are failing their data quality tests? With this, we expect that Data Teams can better prioritize their work and fulfill the metadata to drive more platform usage, allowing users to self-service. | No | @Vinícius Mello |
14 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | Planning to develop a User-Incorporated Application Software for restaurant owners to help them in forecasting the KPIs and Metrices of there business which could help them to understand the demographics behind their sales. What we are going to do? We would be developing an UI end for the restaurant owners to basically add in the mentioned details of any existing or new customer for their restaurant. Details include: Name, Email, Phone, Number, Location, Servicing Waiter, Servicing Chef, Ordered Food. While summing to these details for any customer the restaurant owner would be able to map each and everything for that customers. What next, What are we going to do with the data? - Providing analysis of quantity each food item sold on the basis of day, time, and type of audience - Predict which are more active hours of restaurants - Predict what type of customers are coming in a restaurant (i.e. Student, Couple, Family) so to make restaurants more customer centric - Predict the ratio of online sales vs. offline sales (To help restaurants know which mode is better for them) - Predict the least working day for a restaurant - Predict the location, from where the most number of people joins in the restaurant - Analyze the difference between the food produced and consumed - Analysis of the working hours of each working staff All these statistical analysis would be demographically presented in form of graphs and this could help in the binding of technical data to develop better product oriented strategies for restaurants so to help them know more abut their business. Moreover, in the future version of our product we are planning to automate things up by adding a special QR feature for any order placed up by the restaurant so that the owner works reduces up and thereby the platform becomes more customer oriented as well as time savvy for owners. Tech Stack: Front-end: React JS and Tailwind CSS Back-end: Firebase | No | Dhruv Trehan |
15 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | Poppin bottles, chasing costly snowflake-dbt-models Problem: dbt models are easy to create, but also easy to make expensive. As a platform owner, you need to keep track of your costs. Solution: For each dbt model, I will calculate the cost and expose it on the dataset/model page on dbt. This should be possible using the query logs, and connect it to respective model. This does then need to be ingested into Datahub somehow. If possible it would be exposed as a dashboard over time, and potentially show different kinds of insights. I however, do not know much about Datahub, we do not use it at my company today. I have only been following the project for some time and I hope to be able to contribute with this feature somehow. | Maybe | @Kevin Neville |
16 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | Don't know. I know a little about Apache Atlas & Amundsen | Maybe | Rock Pereira |
17 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | Helping Deaf and dumb people . My project will translate letters and words into sign language so that they can learn . Trying to deploy it as app in future . | No | Fahad |
18 | Active Metadata: Supercharging data governance practices with metadata change events | My hack is to build a Metadata management tool... Metadata management is a cross-organizational agreement on how to define informational assets. The importance of metadata: Metadata is important because it maximizes the value of information assets through the necessary context and the consistent use of terminology. Metadata helps you discover, classify, describe, archive, control and manage data. Features for an effective metadata management tool: Effective metadata management provides rich context to enterprise data assets, powering their efficient use and reuse. It is also a requirement for government agencies across the world to ensure data governance. According to the 2020 Dataversity Report on Trends in Data Management, many organizations place increasing value on data governance and metadata management to build a data landscape that informs about all the data assets in the organization. 1. Define a metadata strategy Metadata management begins with defining a metadata strategy to provide a foundation for assessing the long-term value. Gartner Research recommends that metadata management initiatives must clearly support the organization’s business vision and resulting business objectives. Some critical questions to ask to arrive at a metadata strategy are: - For what type of information do we want to manage metadata, now and in the future? - For what type of information are we not able to manage metadata? - What are your current and prospective use cases for metadata? - What are your use cases for showing relationships between metadata objects? 2. Establish scope and ownership Identifying the scope is always a best practice to channel the efforts in the right direction and focus on high-priority activities. Identify your potential use cases for metadata – data governance, data quality, data analysis, risk and compliance are some of the top use cases. Capture functional metadata requirements – examples include resource discovery, digital identification, or archiving. Specify all possible ways we will use the proposed solution – metadata capture and storage, integration, publication and more. Define the scope considering essential requirements and critical use cases. Defining clear ownership and responsibilities ensures accountability for metadata quality. Well-articulated roles will help you optimize resource utilization. Critical data is no more than 10 to 20 per cent of total data in most organizations. Prioritize data assets and focus metadata leadership accordingly. 3. Adopt the metadata standards The recently published DoD Data Strategy emphasizes metadata tagging and common metadata standards for data-centric organizations. Common metadata standards assure uniform usage and interpretation across your vendor and customer communities. Metadata standards have been evolving over the years and they vary in levels of detail and complexity. The general metadata standards like the Dublin Core Metadata Element Set apply to broader communities and make your data more interoperable with other standards. The subject-specific metadata standards, on the other hand, help search data more easily. For example, the ISO 19115 standard works well for the geospatial community. We can evaluate which standards align the best with our use cases and your communities. Metadata management is much more than a one-time activity. Define and implement metadata lifecycle management so that we are ready for changes, enhancements and identifying areas of improvement. | No | Ayushman Singh Chauhan |
19 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | Auto-annotate DH entities by learning User annotations Abstract: Today, DH users should manually add Glossary for schema fields in DH UI. These user annotations can be learnt to automate Glossary annotations on new entities in DH. This will save multiple hours per week/month for DH users by avoiding explicitly annotating the fields. Proposal: We will train ML models to learn User annotations on a subset of DH data. Users can train multiple ML models for different subsets within DH. These trained models will later be applied to a selected subset to automate the Glossary annotation. These predictions will be treated as inferences until Users accept/reject it. | No | Chanakya Reddy, RAJAT Gupta, Shrey Batra |
20 | Data Products and Data Mesh: Creating a schema-first ecosystem organized around domains | The schemas used for the contracts of the data is often shared among multiple data stores. We have a use case of Avro schemas that specifies the data on Kafka and Hive. Currently on Datahub, there is support to fetch Avro schemas from Schema Registry through Kafka but in the general sense, Schema Registry and the Avro schemas are independent entities from Kafka topics. In the solution we propose, we would like to parse Avro schemas directly from the Schema Registry and then somehow relate this schemas with Kafka topics and Hive tables(This might be a bit trickier). This will be representing the real relations of the schemas with the data stores better and we will be decoupling Kafka and Schema Registry ingestions. | No | Mert Tunc |
21 | Active Metadata: Supercharging data governance practices with metadata change events | -Analysis of the security vulnerabilities of the protocol (MAVLink), 2- Based on the vulnerability of the MAVLink protocol, we propose an attack methodology that can disable an ongoing mission of a UAV (ethical hacking). mission in progress (ethical hacking). This attack can be empirically validated in a dedicated platform. 3- Development of a cyber attack detection algorithm. 4- Implementation of different encryption algorithms (i.e. AES-CBC, AES-CTR, RC4 and ChaCha20) in order to secure the MAVLINK communication against attacks. | Maybe | no |
22 | Active Metadata: Supercharging data governance practices with metadata change events | Enabling the DataHub users and groups to leave marks and explanatory comments on datasets and any dataset related fields, aspects, documentation, structure, etc when they see or think there is an inconsistency between the actual dataset and the representation in the DataHub. So that the owners of the dataset can investigate and resolve. | No | Kerem Kahramanoğulları |
23 | Data Products and Data Mesh: Creating a schema-first ecosystem organized around domains | We propose to add single-click workflows to create data products in a data tool from a Dataset's page in Datahub. When a user is viewing a Dataset in Datahub, they may wish to view data in a data analytics tool and run queries without having to preconfigure the tool. We aim to create a button on the Dataset Entity that will redirect users to the data tool and preload it with the Dataset's information. This will allow the user to quickly access and explore the data, simplifying the user experience and creating a data product. | No | Edward Morgan (Raft) |
24 | Data Products and Data Mesh: Creating a schema-first ecosystem organized around domains | Do you remember when signing up for classes in college, you had to fight for a seat in the most popular classes during the signup window? When you didn't know which class to take next after completing a generic prerequisite class? When you had to go different websites to see professor ratings, class grading scales, and historical grade distribution for 1 course? All of these headaches caused by typically poor-quality, school-provided course catalogs which either lack information you want to see or are hard to navigate because they're based on 25-year old BASIC. What is the solution to this? Data Of Course! Data Of Course is a better course catalog which takes advantage of DataHub to help students make better decisions about which classes they want to take. With the beautiful and simple-to-use UI that DataHub provides, students are able to clearly visualize all of the information they want pertaining to a class, thanks to the well-defined models/schemas we will provide. They will not only be able to see information about the class all in one place, but also about its professor(s), TAs, and more. | No | Justin Donn |
25 | Active Metadata: Supercharging data governance practices with metadata change events | Compliance (dataset field tags) annotations in the PDL schema. Data producers are the domain experts to decide if certain fields of a dataset contains sensitive information or not. Currently, this annotation of field tags is done through DataHub UI after the fact in the sense that data schema is defined, data is produced and maybe even leveraged by other downstreams. However for a better and tighter data governance, this annotation should be done early and ideally at the time of schema definition. We propose to leverage custom PDL annotations to annotate/tag fields in a PDL schema and make this tag immediately available in DataHub even before the data is available. This will help prevent metadata drifts. Although, this is being proposed for PDL schemas, this idea could well be extended to other schemas, Avro, Protobuf etc. | No | Kerem Sahin (LinkedIn) |
26 | Active Metadata: Supercharging data governance practices with metadata change events | Adding upstream/downstream lineage dependency to the dataset with a single click of a button Currently, in Datahub, there is no way to add a manual upstream/downstream lineage dependency other than ingestion through code. There may be situations where the lineage information can not be captured automatically through metadata ingestion and the users need to connect two different datasets. The solution can be achieved by creating a new UI button on the lineage page (with/without visualization) of the dataset. After clicking the button: 1 - The user will search and choose which dataset to connect from the list of datasets 2 - Then choose one of the dependency options (whether upstream or downstream) From this point on, gms will take care of ingesting the lineage information. P.S.: I see that only UPSERT ChangeTypeClass is supported, but we need to somehow enable CREATE, UPDATE and DELETE options as well in order to achieve the solution. | No | Salih Can |
27 | Business Glossaries, KPIs, and Metrics: governing the intersection of technical data + business logic through code | Field based search on all type of data using regex available for better management and tracking of it. We are planning to incorporate this feature in the Datahub itself to make it more accessible. It will improve performance, save money, allow old data to be reused for new applications, and keep track of sensitive and legal data. Let's say we're looking for *csam. Then it will return any tables, files, and other types of data that we store in categories that match the pattern. | No | Avi Garg (LinkedIn) |
28 | ||||
29 | ||||
30 | ||||
31 | ||||
32 | ||||
33 | ||||
34 | ||||
35 | ||||
36 | ||||
37 | ||||
38 | ||||
39 | ||||
40 | ||||
41 | ||||
42 | ||||
43 | ||||
44 | ||||
45 | ||||
46 | ||||
47 | ||||
48 | ||||
49 | ||||
50 | ||||
51 | ||||
52 | ||||
53 | ||||
54 | ||||
55 | ||||
56 | ||||
57 | ||||
58 | ||||
59 | ||||
60 | ||||
61 | ||||
62 | ||||
63 | ||||
64 | ||||
65 | ||||
66 | ||||
67 | ||||
68 | ||||
69 | ||||
70 | ||||
71 | ||||
72 | ||||
73 | ||||
74 | ||||
75 | ||||
76 | ||||
77 | ||||
78 | ||||
79 | ||||
80 | ||||
81 | ||||
82 | ||||
83 | ||||
84 | ||||
85 | ||||
86 | ||||
87 | ||||
88 | ||||
89 | ||||
90 | ||||
91 | ||||
92 | ||||
93 | ||||
94 | ||||
95 | ||||
96 | ||||
97 | ||||
98 | ||||
99 | ||||
100 | ||||
101 |