google/consent-based-conversion-adjustments
Code to statistically up-weight conversion values of consenting customers to feed up to 100% of the factual conversion values back into Google Ads.
Consent-based Conversion Adjustments
Problem statement
Given regulatory requirements, customers have the choice to accept or decline
third-party cookies. For those who opt-out of third-party cookie tracking
(hereafter, non-consenting customers), data on their conversions on an
advertiser's website cannot be shared with Smart Bidding. This potential data
loss can lead to worse bidding performance, or drifts in the bidding behaviour
away from the advertiser's initial goals.
We have developed a solution that allows advertisers to capitalise on their
first-party data in order to statistically up-weight conversion values of
customers that gave consent. By doing this, advertisers have the possibility to
feed back up to 100% of the factual conversion values back into Smart Bidding.
Solution description
We take the following approach: For all consenting and non-consenting customers
that converted on a given day, the advertiser has access to first-party data
that describes the customers. Examples could be the adgroup-title that a
conversion is attributed to, the device type used, or demographic information.
Based on this information, a feature space can be created that describes each
consenting and non-consenting customer. Importantly, this feature space has to
be the same for all customers.
Given this feature space, we can create a distance-graph for all consenting
customers in our dataset, and find the nearest consenting customers for each
non-consenting customer. This is done using a NearestNeighbor model. The
non-consenting customer's conversion value can then be split across all
identified nearest consenting customers, in proportion to the similarity between
the non-consenting and the non-consenting customers.
Model Parameters
- Distance metric: We need to define the distance metric to use when
determining the nearest consenting customers. By default, this is set to
manhattan distance. - Radius, number of nearest neighbors, or percentile: In coordination with the
advertiser and depending on the dataset as well as business requirements,
the user can choose between:- setting a fixed radius within which all nearest neighbors should be
selected, - setting a fixed number of nearest neighbors that should be selected for
each non-consenting customer, independent of their distance to them - finding the required radius to ensure that at least
x%of
non-consenting customers would have at least one sufficiently close
neighbor.
- setting a fixed radius within which all nearest neighbors should be
Data requirements
As mentioned above, consenting and non-consenting customers must lie in the same
feature space. This is currently achieved by considering the adgroup a given
customer has clicked on and splitting it according to the advertiser's logic.
This way, customers that came through similar adgroups are considered being more
similar to each other. All customers to be considered must have a valid
conversion value larger zero and must not have missing data.
How to use the solution
This solution uses an Apache Beam pipeline to find the nearest consenting
customers for each non-consenting customer. The following instructions show how
to run the pipeline on Google Cloud Dataflow, however any other suitable Apache
Beam runner may be used as well.
Installation
Note: This solution requires 3.6 <= Python < 3.9 as Beam does not currently
support Python 3.9.
Set up Dataflow Template
-
Navigate to your Google Cloud Project and activate the Cloud Shell
-
Set the current project by running
gcloud config set project [YOUR_PROJECT_ID] -
Clone this repository and
cdinto the project directory -
Download pyenv as described
here. -
Create and activate a virtual environment as follows:
pyenv install 3.8.12 pyenv virtualenv 3.8.12 env pyenv activate env -
Install python3 dependencies
pip3 install -r requirements.txt -
Create a GCS bucket (if one does not already exist) where the Dataflow
template as well as all inputs and outputs will be stored -
Set an environment variable with the name of the bucket
export PIPELINE_BUCKET=[YOUR_CLOUD_STORAGE_BUCKET_NAME] -
To read data from BigQuery, we need to know the project containing your
BigQuery tables. Set an environment variableexport BQ_PROJECT_ID=[YOUR_BIGQUERY_PROJECT_ID] -
Additionally, set the location of your BigQuery tables
export BIGQUERY_LOCATION=[YOUR_BIGQUERY_LOCATION]e.g. 'EU' for Europe -
Set an environment variable with the name of your BigQuery table with
consenting user dataexport TABLE_CONSENT=[CONSENTING_USER_TABLE_NAME] -
Set an environment variable with the name of your BigQuery table with
non-consenting user dataexport TABLE_NOCONSENT=[NON_CONSENTING_USER_TABLE_NAME] -
Set an environment variable with the name of the data column in your tables
export DATE_COLUMN=[DATA_COLUMN_NAME] -
To up-weight the conversion values in our dataset, we need to know which
column represents the conversion values in the input data. Set an
environment variable with the name of the conversion columnexport CONVERSION_COLUMN=[YOUR_CONVERSION_COLUMN_NAME] -
The final output of the pipeline is a CSV file that may be used for Offline
Conversion Import (OCI) into Google Ads or Google Marketing Platform (GMP).
Each row of this OCI CSV must be unique. Set an environment variable with
the list of columns in the input data that together form a unique IDexport ID_COLUMNS=[COMMA_SEPARATED_ID_COLUMNS_LIST]e.g. export
ID_COLUMNS=GCLID,TIMESTAMP,ADGROUP_NAME (no spaces between the commas!) -
You may want to exclude some columns in your data from being used for
matching. Set an environment variable with the list of columns in the input
data that should be dropped e.g.export DROP_COLUMNS=feature2,feature5 -
Provide all categorical columns in your data that should not be
dummy-coded
e.g.export non_dummy_columns=GCLID,TIMESTAMP -
Set an environment variable with the project id
export PROJECT_ID=[YOUR_PROJECT_ID] -
Set an environment variable with the
regional endpoint
to deploy your Dataflow jobexport PIPELINE_REGION=[YOUR_REGIONAL_ENDPOINT] -
Generate the template by running
./generate_template.sh -
Deactivate the virtual env by typing
pyenv deactivateand close cloud
shell
Set up Cloud Function
The Apache Beam pipeline that we set up above will be triggered by a Cloud
Function. Following instructions show how to set up the Cloud Function:
- Open Cloud Functions from the navigation menu in your Google Cloud Project
- If not done already, enable the Cloud Functions and Cloud Build APIs
- Select
Create functionand fill in the required fields such asFunction nameandRegion. Choose Cloud Pub/Sub as a trigger and create a new
topic. We will later write to this topic whenever the BigQuery tables have
new data, thereby triggering the cloud function - Under runtime setting, set timeout to at least 60 seconds to give ample time
for the Cloud Function to run. Click next - Upload the contents of the
cloud_functiondirectory found in the repo to
Cloud Functions - Select Python 3.8 as Runtime and set Entry point to "run".
- Update the required values in
main.pyas marked byTODO(): ... - Deploy the Cloud Function
Set up Cloud Logging to Pub/Sub sink
Note: For this section, we assume that you wish to trigger the Dataflow
pipeline whenever new data is inserted in the non-consented or consented
tables. If you have a different requirement, proceed accordingly with setting
up a trigger for the Cloud Function. See also:
Using Cloud Scheduler and Pub/Sub to trigger a Cloud Function.
-
In Cloud Logging on your Google Cloud Project, filter to the relevant
BigQuery event. For example, to filter by table inserts, use:protoPayload.serviceName="bigquery.googleapis.com" protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" protoPayload.resourceName="projects/[YOUR_PROJECT_ID]/datasets/[YOUR_DATASET]/tables/[YOUR_TABLE_NAME]" resource.labels.project_id="[YOUR_PROJECT_ID]" protoPayload.metadata.tableDataChange.reason="QUERY"Once the relevant event is available, create a sink that routes your logs to
the Pub/Sub topic defined above. For more information on creating sinks, see
the
documentation. -
With this in place, the Dataflow pipeline should get triggered whenever new
data is inserted in your Bigquery tables.
Contributing
See CONTRIBUTING.md for details.
License
Apache 2.0; see LICENSE for details.
Disclaimer
This is not an official Google product.