feat: more user_migration stuff (#450)

* feat: more user_migration stuff * create script to move users by node directly * moved old scripts to `old` directory (for historic reasons, as well as possible future use) * cleaned up README * try to solve the `parent row` error an intermittent error may be responsible from one of two things: 1) a transaction failure resulted in a premature add of the unique key to the UC filter. 2) an internal spanner update error resulting from trying to write the bso before the user_collection row was written. * Added "fix_collections.sql" script to update collections table to add well known collections for future rectification. * returned collection name lookup * add "--user" arg to set bso and user id * add `--dryrun` mode
2025-08-07 12:26:57 +02:00 · 2020-03-02 20:26:07 -08:00 · 2020-03-02 20:26:07 -08:00 · ecfca9fdf5
commit ecfca9fdf5
parent 13b44f488d
10 changed files with 1069 additions and 210 deletions
--- a/tools/user_migration/README.md
+++ b/tools/user_migration/README.md
@ -1,16 +1,16 @@
-# User Migration Script
+ # User Migration Script

 This is a workspace for testing user migration from the old databases to the new durable one.

-Avro is a JSON like transport system. I'm not quite sure it's really needed. Mostly since
-it's basically "read a row/write a row" only with a middleman system. I wonder if it might
-be better to just iterate through mysql on a thread and write to spanner directly.
-
 There are several candidate scrips that you can use. More than likely, you want to use

-`migrate_user.py --dsns <file of database DSNs> --users <file of user ids>
-    [--token_dsn <tokenserver DSN>]`
-
+```bash
+GOOGLE_APPLICATION_CREDENTIALS=credentials.json migrate_node.py \
+    [--dsns=move_dsns.lst] \
+    [--deanon --fxa_file=users.csv] \
+    [--start_bso=0] \
+    [--end_bso=19]
+```
 where:

 * *dsns* - a file containing the mysql and spanner DSNs for the users. Each DSN should be on a single line. Currently only one DSN of a given type is permitted.
@ -22,26 +22,52 @@ mysql://test:test@localhost/syncstorage
 spanner://projects/sync-spanner-dev-225401/instances/spanner-test/databases/sync_schema3
 ```

-* *users* - A file containing the list of mysql userIDs to move. Each should be on a new line.
-
-(e.g.)
-
-```text
-1
-3
-1298127571
+* *users.csv* - a mysql dump of the token database. This file is only needed if the `--deanon` de-anonymization flag is set. By default, data is anononymized to prevent accidental movement.
+You can produce this file from the following:
+```bash
+mysql -e "select uid, email, generation, keys_changed_at, \
+ client_state from users;" > users.csv`
 ```
+The script will automatically skip the title row, and presumes that fields are tab separated.

-* *token_dsn* - An optional DSN to the Token Server DB. The script will automatically update the `users` table to indicate the user is now on the spanner node (`800`). If no *token_dsn* option is provided, the token_db is not altered, manual updates may be required.
-
-The other scripts (e.g. `dump_mysql.py`) dump the node database into Arvo compatible format. Additional work may be required to format the data for eventual import. Note that these scripts may require python 2.7 due to dependencies in the Arvo library.
+UserIDs are converted to fxa_uid/fxa_kid values and cached locally.

 ## installation

-`virtualenv venv && venv/bin/pip install -r requirements.txt`
+```bash
+virtualenv venv && venv/bin/pip install -r requirements.txt
+```

 ## running

-Since you will be connecting to the GCP Spanner API, you will need to have set the `GOOGLE_APPLICATION_CREDENTIALS` env var before running these scripts.
+Since you will be connecting to the GCP Spanner API, you will need to have set the `GOOGLE_APPLICATION_CREDENTIALS` env var before running these scripts. This environment variable should point to the exported Google Credentials acquired from the GCP console.

-`GOOGLE_APPLICATION_CREDENTIALS=path/to/creds.json venv/bin/python dump_avro.py`
+The script will take the following actions:
+
+1. fetch all users from a given node.
+1. compare and port all user_collections over (NOTE: this may involve remapping collecitonid values.)
+1. begin copying over user information from mysql to spanner
+
+Overall performance may be improved by "batching" BSOs to different
+processes using:
+
+`--start_bso` the BSO database (defaults to 0, inclusive) to begin
+copying from
+
+`--end_bso` the final BSO databse (defaults to 19, inclusive) to copy
+from
+
+Note that these are inclusive values. So to split between two
+processes, you would want to use
+
+```bash
+migrate_node.py --start_bso=0 --end_bso=9 &
+migrate_node.py --start_bso=10 --end_bso=19 &
+```
+
+(As short hand for this case, you could also do:
+```
+migrate_node.py --end_bso=9 &
+migrate_node.py --start_bso=10 &
+```
+and let the defaults handle the rest.)
--- a/tools/user_migration/dump_mysql.py
+++ b/tools/user_migration/dump_mysql.py
@ -1,173 +0,0 @@
-#! venv/bin/python
-
-# painfully stupid script to check out dumping mysql databases to avro.
-# Avro is basically "JSON" for databases. It's not super complicated & it has
-# issues (one of which is that it requires Python2).
-#
-#
-
-import avro.schema
-import argparse
-import base64
-import time
-
-from avro.datafile import DataFileWriter
-from avro.io import DatumWriter
-from mysql import connector
-from urlparse import urlparse
-
-
-class BadDSNException(Exception):
-    pass
-
-
-def get_args():
-    parser = argparse.ArgumentParser(description="dump spanner to arvo files")
-    parser.add_argument(
-        '--dsns', default="dsns.lst",
-        help="file of new line separated DSNs")
-    parser.add_argument(
-        '--schema', default="sync.avsc",
-        help="Database schema description")
-    parser.add_argument(
-        '--output', default="output.avso",
-        help="Output file")
-    parser.add_argument(
-        '--limit', type=int, default=1500000,
-        help="Limit each read chunk to n rows")
-    return parser.parse_args()
-
-
-def conf_db(dsn):
-    dsn = urlparse(dsn)
-    if dsn.scheme != "mysql":
-        raise BadDSNException("Invalid MySQL dsn: {}".format(dsn))
-    connection = connector.connect(
-        user=dsn.username,
-        password=dsn.password,
-        host=dsn.hostname,
-        port=dsn.port or 3306,
-        database=dsn.path[1:]
-    )
-    return connection
-
-
-# The following two functions are taken from browserid.utils
-def encode_bytes_b64(value):
-    return base64.urlsafe_b64encode(value).rstrip(b'=').decode('ascii')
-
-
-def format_key_id(keys_changed_at, key_hash):
-    return "{:013d}-{}".format(
-        keys_changed_at,
-        encode_bytes_b64(key_hash),
-    )
-
-
-def get_fxa_id(database, user):
-    sql = """
-        SELECT
-            email, generation, keys_changed_at, client_state
-        FROM users
-            WHERE uid = {uid}
-    """.format(uid=user)
-    cursor = database.cursor()
-    cursor.execute(sql)
-    (email, generation, keys_changed_at, client_state) = cursor.next()
-    fxa_uid = email.split('@')[0]
-    fxa_kid = format_key_id(
-        keys_changed_at or generation,
-        bytes.fromhex(client_state),
-    )
-    cursor.close()
-    return (fxa_kid, fxa_uid)
-
-
-def dump_rows(offset, db, writer, args):
-    # bso column mapping:
-    # id => bso_id
-    # collection => collection_id
-    # sortindex => sortindex
-    # modified => modified
-    # payload => payload
-    # payload_size => NONE
-    # ttl => expiry
-
-    print("Querying.... @{}".format(offset))
-    sql = """
-    SELECT userid, collection, id,
-    ttl, modified, payload,
-    sortindex from bso LIMIT {} OFFSET {}""".format(
-        args.limit, offset)
-    cursor = db.cursor()
-    user = None
-    try:
-        cursor.execute(sql)
-        print("Dumping...")
-        for (userid, cid, bid, exp, mod, pay, si) in cursor:
-            if userid != user:
-                (fxa_kid, fxa_uid) = get_fxa_id(db, userid)
-                user = userid
-            writer.append({
-                "collection_id": cid,
-                "fxa_kid": fxa_kid,
-                "fxa_uid": fxa_uid,
-                "bso_id": bid,
-                "expiry": exp,
-                "modified": mod,
-                "payload": pay,
-                "sortindex": si})
-            offset += 1
-            if offset % 1000 == 0:
-                print("Row: {}".format(offset))
-        return offset
-    except Exception as e:
-        print("Deadline hit at: {} ({})".format(offset, e))
-        return offset
-    finally:
-        cursor.close()
-
-
-def count_rows(db):
-    cursor = db.cursor()
-    try:
-        cursor.execute("SELECT Count(*) from bso")
-        return cursor.fetchone()[0]
-    finally:
-        cursor.close()
-
-
-def dump_data(args, schema, dsn):
-    offset = 0
-    # things time out around 1_500_000 rows.
-    # yes, this dumps from spanner for now, I needed a big db to query
-    db = conf_db(dsn)
-    writer = DataFileWriter(
-        open(args.output, "wb"), DatumWriter(), schema)
-    row_count = count_rows(db)
-    print("Dumping {} rows".format(row_count))
-    while offset < row_count:
-        old_offset = offset
-        offset = dump_rows(offset=offset, db=db, writer=writer, args=args)
-        if offset == old_offset:
-            break
-    writer.close()
-    return row_count
-
-
-def main():
-    start = time.time()
-    args = get_args()
-    dsns = open(args.dsns).readlines()
-    schema = avro.schema.parse(open(args.schema, "rb").read())
-    for dsn in dsns:
-        print("Starting: {}".format(dsn))
-        try:
-            rows = dump_data(args, schema, dsn)
-        except Exception as ex:
-            print("Could not process {}: {}".format(dsn, ex))
-    print("Dumped: {} rows in {} seconds".format(rows, time.time() - start))
-
-
-if __name__ == "__main__":
-    main()
--- a/tools/user_migration/fix_collections.sql
+++ b/tools/user_migration/fix_collections.sql
@ -0,0 +1,15 @@
+INSERT IGNORE INTO collections (name, collectionid) VALUES
+        ("clients", 1),
+        ("crypto", 2),
+        ("forms", 3),
+        ("history", 4),
+        ("keys", 5),
+        ("meta", 6),
+        ("bookmarks", 7),
+        ("prefs", 8),
+        ("tabs", 9),
+        ("passwords", 10),
+        ("addons", 11),
+        ("addresses", 12),
+        ("creditcards", 13),
+        ("reserved", 99);
--- a/tools/user_migration/migrate_node.py
+++ b/tools/user_migration/migrate_node.py
@ -0,0 +1,610 @@
+#! venv/bin/python
+
+# painfully stupid script to check out dumping mysql databases to avro.
+# Avro is basically "JSON" for databases. It's not super complicated & it has
+# issues (one of which is that it requires Python2).
+#
+#
+
+import argparse
+import logging
+import base64
+import binascii
+import csv
+import sys
+import math
+import json
+import os
+import time
+from datetime import datetime
+
+from mysql import connector
+from google.cloud import spanner
+from google.api_core.exceptions import AlreadyExists, InvalidArgument
+try:
+    from urllib.parse import urlparse
+except ImportError:
+    from urlparse import urlparse
+
+META_GLOBAL_COLLECTION_ID = 6
+MAX_ROWS = 1500000
+
+
+class BadDSNException(Exception):
+    pass
+
+
+class FXA_info:
+    """User information from Tokenserver database.
+
+    Can be constructed from
+    ``mysql -e "select uid, email, generation, keys_changed_at, \
+       client_state from users;" > users.csv`
+    """
+    users = {}
+    anon = False
+
+    def __init__(self, fxa_csv_file, args):
+        if args.anon:
+            self.anon = True
+            return
+        logging.debug("Processing token file...")
+        if not os.path.isfile(fxa_csv_file):
+            raise IOError("{} not found".format(fxa_csv_file))
+        with open(fxa_csv_file) as csv_file:
+            try:
+                line = 0
+                for (uid, email, generation,
+                     keys_changed_at, client_state) in csv.reader(
+                        csv_file, delimiter="\t"):
+                    line += 1
+                    if uid == 'uid':
+                        # skip the header row.
+                        continue
+                    try:
+                        fxa_uid = email.split('@')[0]
+                        fxa_kid = self.format_key_id(
+                            int(keys_changed_at or generation),
+                            binascii.unhexlify(client_state))
+                        logging.debug("Adding user {} => {} , {}".format(
+                            uid, fxa_kid, fxa_uid
+                        ))
+                        self.users[int(uid)] = (fxa_kid, fxa_uid)
+                    except Exception as ex:
+                        logging.error("Skipping user {}:".format(uid), ex)
+            except Exception as ex:
+                logging.critical("Error in fxa file around line {}: {}".format(
+                    line, ex))
+
+    # The following two functions are taken from browserid.utils
+    def encode_bytes_b64(self, value):
+        return base64.urlsafe_b64encode(value).rstrip(b'=').decode('ascii')
+
+    def format_key_id(self, keys_changed_at, key_hash):
+        return "{:013d}-{}".format(
+            keys_changed_at,
+            self.encode_bytes_b64(key_hash),
+        )
+
+    def get(self, userid):
+        if userid in self.users:
+            return self.users[userid]
+        if self.anon:
+            fxa_uid = "fake_" + binascii.hexlify(
+                os.urandom(11)).decode('utf-8')
+            fxa_kid = "fake_" + binascii.hexlify(
+                os.urandom(11)).decode('utf-8')
+            self.users[userid] = (fxa_kid, fxa_uid)
+            return (fxa_kid, fxa_uid)
+
+
+class Collections:
+    """Cache spanner collection list.
+
+    The spanner collection list is the (soon to be) single source of
+    truth regarding collection ids.
+
+    """
+    _by_name = {
+        "clients": 1,
+        "crypto": 2,
+        "forms": 3,
+        "history": 4,
+        "keys": 5,
+        "meta": 6,
+        "bookmarks": 7,
+        "prefs": 8,
+        "tabs": 9,
+        "passwords": 10,
+        "addons": 11,
+        "addresses": 12,
+        "creditcards": 13,
+        "reserved": 100,
+    }
+    spanner = None
+
+    def __init__(self, databases):
+        """merge the mysql user_collections into spanner"""
+        sql = """
+        SELECT
+            DISTINCT uc.collection, cc.name
+        FROM
+            user_collections as uc,
+            collections as cc
+        WHERE
+            uc.collection = cc.collectionid
+        ORDER BY
+            uc.collection
+        """
+        cursor = databases['mysql'].cursor()
+
+        def transact(transaction, values):
+            transaction.insert(
+                'collections',
+                columns=('collection_id', 'name'),
+                values=values)
+
+        self.spanner = databases['spanner']
+        try:
+            # fetch existing:
+            with self.spanner.snapshot() as scursor:
+                rows = scursor.execute_sql(
+                    "select collection_id, name from collections")
+                for (collection_id, name) in rows:
+                    logging.debug("Loading collection: {} => {}".format(
+                        name, collection_id
+                    ))
+                    self._by_name[name] = collection_id
+            cursor.execute(sql)
+            for (collection_id, name) in cursor:
+                if name not in self._by_name:
+                    logging.debug("Adding collection: {} => {}".format(
+                        name, collection_id
+                    ))
+                    values = [(collection_id, name)]
+                    self._by_name[name] = collection_id
+                    # Since a collection may collide, do these one at a time.
+                    try:
+                        self.spanner.run_in_transaction(transact, values)
+                    except AlreadyExists:
+                        logging.info(
+                            "Skipping already present collection {}".format(
+                                values
+                            ))
+                        pass
+        finally:
+            cursor.close()
+
+    def get(self, name, collection_id=None):
+        """Fetches the collection_id"""
+
+        id = self._by_name.get(name)
+        if id is None:
+            logging.warn(
+                "Unknown collection {}:{} encountered!".format(
+                    name, collection_id))
+            # it would be swell to add these to the collection table,
+            # but that would mean
+            # an imbedded spanner transaction, and that's not allowed.
+            return None
+        return id
+
+
+def conf_mysql(dsn):
+    """create a connection to the original storage system """
+    logging.debug("Configuring MYSQL: {}".format(dsn))
+    connection = connector.connect(
+        user=dsn.username,
+        password=dsn.password,
+        host=dsn.hostname,
+        port=dsn.port or 3306,
+        database=dsn.path[1:]
+    )
+    return connection
+
+
+def conf_spanner(dsn):
+    """create a connection to the new Spanner system"""
+    logging.debug("Configuring SPANNER: {}".format(dsn))
+    path = dsn.path.split("/")
+    instance_id = path[-3]
+    database_id = path[-1]
+    client = spanner.Client()
+    instance = client.instance(instance_id)
+    database = instance.database(database_id)
+    return database
+
+
+def conf_db(dsn):
+    """read the list of storage definitions from the file and create
+    a set of connetions.
+
+     """
+    if "mysql" in dsn.scheme:
+        return conf_mysql(dsn)
+    if dsn.scheme == "spanner":
+        return conf_spanner(dsn)
+    raise RuntimeError("Unknown DSN type: {}".format(dsn.scheme))
+
+
+def dumper(columns, values):
+    """verbose column and data dumper. """
+    result = ""
+    for row in values:
+        for i in range(0, len(columns)):
+            result += " {} => {}\n".format(columns[i], row[i])
+    return result
+
+
+def newSyncID():
+    base64.urlsafe_b64encode(os.urandom(9))
+
+
+def alter_syncids(pay):
+    """Alter the syncIDs for the meta/global record, which will cause a sync
+    when the client reconnects
+
+
+    """
+    payload = json.loads(pay)
+    payload['syncID'] = newSyncID()
+    for item in payload['engines']:
+        payload['engines'][item]['syncID'] = newSyncID()
+    return json.dumps(payload)
+
+
+def divvy(biglist, count):
+    lists = []
+    biglen = len(biglist)
+    start = 0
+    while start < biglen:
+        lists.append(biglist[start:min(start+count, biglen)])
+        start += count
+    return lists
+
+
+def move_user(databases, user, collections, fxa, bso_num, args):
+    """copy user info from original storage to new storage."""
+    # bso column mapping:
+    # id => bso_id
+    # collection => collection_id
+    # sortindex => sortindex
+    # modified => modified
+    # payload => payload
+    # payload_size => NONE
+    # ttl => expiry
+
+    uc_columns = (
+        'fxa_kid',
+        'fxa_uid',
+        'collection_id',
+        'modified',
+    )
+    bso_columns = (
+            'collection_id',
+            'fxa_kid',
+            'fxa_uid',
+            'bso_id',
+            'expiry',
+            'modified',
+            'payload',
+            'sortindex',
+    )
+
+    # Genereate the Spanner Keys we'll need.
+    try:
+        (fxa_kid, fxa_uid) = fxa.get(user)
+    except TypeError:
+        logging.error("User not found: {} ".format(
+            user
+        ))
+        return 0
+    except Exception as ex:
+        logging.error(
+            "Could not move user: {}".format(user),
+            exc_info=ex
+        )
+        return 0
+
+    # Fetch the BSO data from the original storage.
+    sql = """
+    SELECT
+        collections.name, bso.collection,
+        bso.id, bso.ttl, bso.modified, bso.payload, bso.sortindex
+    FROM
+        bso{} as bso,
+        collections
+    WHERE
+        bso.userid = %s
+            and collections.collectionid = bso.collection
+            and bso.ttl > unix_timestamp()
+    ORDER BY
+        modified DESC""".format(bso_num)
+
+    def spanner_transact_uc(
+            transaction, data, fxa_kid, fxa_uid, args):
+        # user collections require a unique key.
+        unique_key_filter = set()
+        for (col, cid, bid, exp, mod, pay, sid) in data:
+            collection_id = collections.get(col, cid)
+            if collection_id is None:
+                continue
+            # columns from sync_schema3
+            mod_v = datetime.utcfromtimestamp(mod/1000.0)
+            # User_Collection can only have unique values. Filter
+            # non-unique keys and take the most recent modified
+            # time. The join could be anything.
+            uc_key = "{}_{}_{}".format(fxa_uid, fxa_kid, col)
+            if uc_key not in unique_key_filter:
+                uc_values = [(
+                    fxa_kid,
+                    fxa_uid,
+                    collection_id,
+                    mod_v,
+                )]
+                if not args.dryrun:
+                    transaction.replace(
+                        'user_collections',
+                        columns=uc_columns,
+                        values=uc_values
+                    )
+                else:
+                    logging.debug("not writing {} => {}".format(uc_columns, uc_values))
+                unique_key_filter.add(uc_key)
+
+    def spanner_transact_bso(transaction, data, fxa_kid, fxa_uid, args):
+        count = 0
+        for (col, cid, bid, exp, mod, pay, sid) in data:
+            collection_id = collections.get(col, cid)
+            if collection_id is None:
+                next
+            if collection_id != cid:
+                logging.debug(
+                    "Remapping collection '{}' from {} to {}".format(
+                        col, cid, collection_id))
+            # columns from sync_schema3
+            mod_v = datetime.utcfromtimestamp(mod/1000.0)
+            exp_v = datetime.utcfromtimestamp(exp)
+
+            # add the BSO values.
+            if args.full and collection_id == META_GLOBAL_COLLECTION_ID:
+                pay = alter_syncids(pay)
+            bso_values = [[
+                    collection_id,
+                    fxa_kid,
+                    fxa_uid,
+                    bid,
+                    exp_v,
+                    mod_v,
+                    pay,
+                    sid,
+            ]]
+
+            if not args.dryrun:
+                logging.debug(
+                    "###bso{} {}".format(
+                        bso_num,
+                        dumper(bso_columns, bso_values)
+                    )
+                )
+                transaction.insert(
+                    'bsos',
+                    columns=bso_columns,
+                    values=bso_values
+                )
+            else:
+                logging.debug("not writing {} => {}".format(bso_columns, bso_values))
+            count += 1
+        return count
+
+    cursor = databases['mysql'].cursor()
+    count = 0
+    try:
+        # Note: cursor() does not support __enter__()
+        logging.info("Processing... {} -> {}:{}".format(
+            user, fxa_uid, fxa_kid))
+        cursor.execute(sql, (user,))
+        data = []
+        for row in cursor:
+            data.append(row)
+        for bunch in divvy(data, args.readchunk or 1000):
+            # Occasionally, there is a batch fail because a
+            # user collection is not found before a bso is written.
+            # to solve that, divide the UC updates from the
+            # BSO updates.
+            # Run through the list of UserCollection updates
+            databases['spanner'].run_in_transaction(
+                spanner_transact_uc,
+                bunch,
+                fxa_kid,
+                fxa_uid,
+                args,
+            )
+            count += databases['spanner'].run_in_transaction(
+                spanner_transact_bso,
+                bunch,
+                fxa_kid,
+                fxa_uid,
+                args,
+            )
+
+    except AlreadyExists:
+        logging.warn(
+            "User already imported fxa_uid:{} / fxa_kid:{}".format(
+                fxa_uid, fxa_kid
+            ))
+    except InvalidArgument as ex:
+        if "already inserted" in ex.args[0]:
+            logging.warn(
+                "User already imported fxa_uid:{} / fxa_kid:{}".format(
+                    fxa_uid, fxa_kid
+                ))
+        else:
+            raise
+    except Exception as e:
+        logging.error("### batch failure:", exc_info=e)
+    finally:
+        # cursor may complain about unread data, this should prevent
+        # that warning.
+        for result in cursor:
+            pass
+        cursor.close()
+    return count
+
+
+def move_database(databases, collections, bso_num, fxa, args):
+    """iterate over provided users and move their data from old to new"""
+    start = time.time()
+    cursor = databases['mysql'].cursor()
+    # off chance that someone else might have written
+    # a new collection table since the last time we
+    # fetched.
+    rows = 0
+    cursor = databases['mysql'].cursor()
+    users = []
+    if args.user:
+        users = [args.user]
+    else:
+        try:
+            sql = """select distinct userid from bso{};""".format(bso_num)
+            cursor.execute(sql)
+            users = [user for (user,) in cursor]
+        except Exception as ex:
+            logging.error("Error moving database:", exc_info=ex)
+            return rows
+        finally:
+            cursor.close()
+    logging.info("Moving {} users".format(len(users)))
+    for user in users:
+        rows += move_user(
+            databases=databases,
+            user=user,
+            collections=collections,
+            fxa=fxa,
+            bso_num=bso_num,
+            args=args)
+    logging.info("Finished BSO #{} ({} rows) in {} seconds".format(
+        bso_num,
+        rows,
+        math.ceil(time.time() - start)
+    ))
+    return rows
+
+
+def get_args():
+    parser = argparse.ArgumentParser(
+        description="move user from sql to spanner")
+    parser.add_argument(
+        '--dsns', default="move_dsns.lst",
+        help="file of new line separated DSNs")
+    parser.add_argument(
+        '--verbose',
+        action="store_true",
+        help="verbose logging"
+    )
+    parser.add_argument(
+        '--quiet',
+        action="store_true",
+        help="silence logging"
+    )
+    parser.add_argument(
+        '--chunk_limit', type=int, default=1500000,
+        dest='limit',
+        help="Limit each read chunk to n rows")
+    parser.add_argument(
+        '--offset', type=int, default=0,
+        help="UID to start at")
+    parser.add_argument(
+        "--full",
+        action="store_true",
+        help="force a full reconcile"
+    )
+    parser.add_argument(
+        '--deanon', action='store_false',
+        dest='anon',
+        help="Do not anonymize the user data"
+    )
+    parser.add_argument(
+        '--start_bso', default=0,
+        type=int,
+        help="start dumping BSO database"
+    )
+    parser.add_argument(
+        '--end_bso',
+        type=int, default=19,
+        help="last BSO database to dump"
+    )
+    parser.add_argument(
+        '--fxa_file',
+        default="users.csv",
+        help="FXA User info in CSV format"
+    )
+    parser.add_argument(
+        '--skip_collections', action='store_false',
+        help="skip user_collections table"
+    )
+    parser.add_argument(
+        '--readchunk',
+        default=1000,
+        help="how many rows per transaction for spanner"
+    )
+    parser.add_argument(
+        '--user',
+        type=str,
+        help="BSO#:userId to move (EXPERIMENTAL)."
+    )
+    parser.add_argument(
+        '--dryrun',
+        action="store_true",
+        help="Do not write user records to spanner."
+    )
+
+
+    return parser.parse_args()
+
+
+def main():
+    args = get_args()
+    log_level = logging.INFO
+    if args.quiet:
+        log_level = logging.ERROR
+    if args.verbose:
+        log_level = logging.DEBUG
+    logging.basicConfig(
+        stream=sys.stdout,
+        level=log_level,
+    )
+    dsns = open(args.dsns).readlines()
+    databases = {}
+    rows = 0
+
+    if args.user:
+        (bso, userid) = args.user.split(':')
+        args.start_bso = int(bso)
+        args.end_bso = int(bso)
+        args.user = int(userid)
+    for line in dsns:
+        dsn = urlparse(line.strip())
+        scheme = dsn.scheme
+        if 'mysql' in dsn.scheme:
+            scheme = 'mysql'
+        databases[scheme] = conf_db(dsn)
+    if not databases.get('mysql') or not databases.get('spanner'):
+        RuntimeError("Both mysql and spanner dsns must be specified")
+    fxa_info = FXA_info(args.fxa_file, args)
+    collections = Collections(databases)
+    logging.info("Starting:")
+    if args.dryrun:
+        logging.info("=== DRY RUN MODE ===")
+    start = time.time()
+    for bso_num in range(args.start_bso, args.end_bso+1):
+        logging.info("Moving users in bso # {}".format(bso_num))
+        rows += move_database(
+            databases, collections, bso_num, fxa_info, args)
+    logging.info(
+        "Moved: {} rows in {} seconds".format(
+            rows or 0, time.time() - start))
+
+
+if __name__ == "__main__":
+    main()
--- a/tools/user_migration/old/dump_avro.py
+++ b/tools/user_migration/old/dump_avro.py
--- a/tools/user_migration/old/dump_mysql.py
+++ b/tools/user_migration/old/dump_mysql.py
@ -0,0 +1,312 @@
+#! venv/bin/python
+
+# This file is historical.
+# We're using `migrate_node.py`, however this file may be useful in the future
+# if we determine there's a problem with directly transcribing the data from
+# mysql to spanner.
+#
+# painfully stupid script to check out dumping mysql databases to avro.
+# Avro is basically "JSON" for databases. It's not super complicated & it has
+# issues.
+#
+
+import avro.schema
+import argparse
+import binascii
+import csv
+import base64
+import math
+import time
+import os
+import random
+import re
+
+from avro.datafile import DataFileWriter
+from avro.io import DatumWriter
+from mysql import connector
+try:
+    from urllib.parse import urlparse
+except:
+    from urlparse import urlparse
+
+
+MAX_ROWS=1500000
+
+class BadDSNException(Exception):
+    pass
+
+
+def get_args():
+    parser = argparse.ArgumentParser(description="dump spanner to arvo files")
+    parser.add_argument(
+        '--dsns', default="dsns.lst",
+        help="file of new line separated DSNs")
+    parser.add_argument(
+        '--schema', default="sync.avsc",
+        help="Database schema description")
+    parser.add_argument(
+        '--col_schema', default="user_collection.avsc",
+        help="User Collection schema description"
+    )
+    parser.add_argument(
+        '--output', default="output.avso",
+        help="Output file")
+    parser.add_argument(
+        '--limit', type=int, default=1500000,
+        help="Limit each read chunk to n rows")
+    parser.add_argument(
+        '--offset', type=int, default=0,
+        help="UID to start at")
+    parser.add_argument(
+        '--deanon', action='store_false',
+        dest='anon',
+        help="Anonymize the user data"
+    )
+    parser.add_argument(
+        '--start_bso', default=0,
+        type=int,
+        help="start dumping BSO database"
+    )
+    parser.add_argument(
+        '--end_bso',
+        type=int, default=19,
+        help="last BSO database to dump"
+    )
+    parser.add_argument(
+        '--token_file',
+        default='users.csv',
+        help="token user database dump CSV"
+    )
+    parser.add_argument(
+        '--skip_collections', action='store_false',
+        help="skip user_collections table"
+    )
+
+    return parser.parse_args()
+
+
+def conf_db(dsn):
+    dsn = urlparse(dsn)
+    """
+    if dsn.scheme != "mysql":
+        raise BadDSNException("Invalid MySQL dsn: {}".format(dsn))
+    """
+    connection = connector.connect(
+        user=dsn.username,
+        password=dsn.password,
+        host=dsn.hostname,
+        port=dsn.port or 3306,
+        database=dsn.path[1:]
+    )
+    return connection
+
+
+# The following two functions are taken from browserid.utils
+def encode_bytes_b64(value):
+    return base64.urlsafe_b64encode(value).rstrip(b'=').decode('ascii')
+
+
+def format_key_id(keys_changed_at, key_hash):
+    return "{:013d}-{}".format(
+        keys_changed_at,
+        encode_bytes_b64(key_hash),
+    )
+
+
+user_ids = {}
+
+def read_in_token_file(filename):
+    global user_ids
+    # you can generate the token file using
+    # `mysql -e "select uid, email, generation, keys_changed_at, \
+    #  client_state from users;" > users.csv`
+    #
+    # future opt: write the transmogrified file to either sqlite3
+    # or static files.
+    print("Processing token file...")
+    with open(filename) as csv_file:
+        for (uid, email, generation,
+             keys_changed_at, client_state) in csv.reader(
+                 csv_file, delimiter="\t"):
+            if uid == 'uid':
+                # skip the header row.
+                continue
+            fxa_uid = email.split('@')[0]
+            fxa_kid = "{:013d}-{}".format(
+                int(keys_changed_at or generation),
+                base64.urlsafe_b64encode(
+                    binascii.unhexlify(client_state)
+                    ).rstrip(b'=').decode('ascii'))
+            user_ids[uid] = (fxa_kid, fxa_uid)
+
+
+def get_fxa_id(user_id, anon=True):
+    global user_ids
+    if user_id in user_ids:
+        return user_ids[user_id]
+    if anon:
+        fxa_uid = binascii.hexlify(
+            os.urandom(16)).decode('utf-8')
+        fxa_kid = binascii.hexlify(
+            os.urandom(16)).decode('utf-8')
+        user_ids[user_id] = (fxa_kid, fxa_uid)
+        return (fxa_kid, fxa_uid)
+
+
+def dump_user_collections(schema, dsn, args):
+    # userid => fxa_kid
+    #           fxa_uid
+    # collection => collection_id
+    # last_modified => modified
+    db = conf_db(dsn)
+    cursor = db.cursor()
+    out_file = args.output.rsplit('.', 1)
+    out_file_name = "{}_user_collections.{}".format(
+        out_file[0], out_file[1]
+    )
+    writer = DataFileWriter(
+        open(out_file_name, "wb"), DatumWriter(), schema)
+    sql = """
+    SELECT userid, collection, last_modified from user_collections
+    """
+    start = time.time()
+    try:
+        cursor.execute(sql)
+        row = 0
+        for (user_id, collection_id, last_modified) in cursor:
+            (fxa_uid, fxa_kid) = get_fxa_id(user_id, args.anon)
+            try:
+                writer.append({
+                    "collection_id": collection_id,
+                    "fxa_kid": fxa_kid,
+                    "fxa_uid": fxa_uid,
+                    "modified": last_modified
+                })
+            except Exception as ex:
+                import pdb; pdb.set_trace()
+                print (ex)
+            row += 1
+        print(
+            "Dumped {} user_collection rows in {} seconds".format(
+                row, time.time() - start
+            ))
+    finally:
+        writer.close()
+        cursor.close()
+
+
+def dump_rows(bso_number, chunk_offset, db, writer, args):
+    # bso column mapping:
+    # id => bso_id
+    # collection => collection_id
+    # sortindex => sortindex
+    # modified => modified
+    # payload => payload
+    # payload_size => NONE
+    # ttl => expiry
+
+    ivre = re.compile(r'("IV": ?"[^"]+")')
+    print("Querying.... bso{} @{}".format(bso_number, chunk_offset))
+    sql = """
+    SELECT userid, collection, id,
+    ttl, modified, payload,
+    sortindex from bso{} LIMIT {} OFFSET {}""".format(
+        bso_number, args.limit, chunk_offset)
+    cursor = db.cursor()
+    user = None
+    row_count = 0
+    try:
+        cursor.execute(sql)
+        print("Dumping...")
+        for (userid, cid, bid, exp, mod, pay, si) in cursor:
+            if args.anon:
+                replacement = encode_bytes_b64(os.urandom(16))
+                pay = ivre.sub('"IV":"{}"'.format(replacement), pay)
+            if userid != user:
+                (fxa_kid, fxa_uid) = get_fxa_id(userid, args.anon)
+                user = userid
+            writer.append({
+                "collection_id": cid,
+                "fxa_kid": fxa_kid,
+                "fxa_uid": fxa_uid,
+                "bso_id": bid,
+                "expiry": exp,
+                "modified": mod,
+                "payload": pay,
+                "sortindex": si})
+            row_count += 1
+            if (chunk_offset + row_count) % 1000 == 0:
+                print("BSO:{} Row: {}".format(bso_number, chunk_offset + row_count))
+            if row_count >= MAX_ROWS:
+                break
+    except Exception as e:
+        print("Deadline hit at: {} ({})".format(
+            chunk_offset + row_count, e))
+    finally:
+        cursor.close()
+    return row_count
+
+
+def count_rows(db, bso_num=0):
+    cursor = db.cursor()
+    try:
+        cursor.execute("SELECT Count(*) from bso{}".format(bso_num))
+        return cursor.fetchone()[0]
+    finally:
+        cursor.close()
+
+
+def dump_data(bso_number, schema, dsn, args):
+    offset = args.offset or 0
+    total_rows = 0
+    # things time out around 1_500_000 rows.
+    db = conf_db(dsn)
+    out_file = args.output.rsplit('.', 1)
+    row_count = count_rows(db, bso_number)
+    for chunk in range(
+        max(1, math.trunc(math.ceil(row_count / MAX_ROWS)))):
+        print(
+            "Dumping {} rows from bso#{} into chunk {}".format(
+                row_count, bso_number, chunk))
+        out_file_name = "{}_{}_{}.{}".format(
+            out_file[0], bso_number, hex(chunk), out_file[1]
+        )
+        writer = DataFileWriter(
+            open(out_file_name, "wb"), DatumWriter(), schema)
+        rows = dump_rows(
+            bso_number=bso_number,
+            chunk_offset=offset,
+            db=db,
+            writer=writer,
+            args=args)
+        writer.close()
+        if rows == 0:
+            break
+        offset = offset + rows
+        chunk += 1
+    return rows
+
+
+def main():
+    args = get_args()
+    rows = 0
+    dsns = open(args.dsns).readlines()
+    schema = avro.schema.parse(open(args.schema, "rb").read())
+    col_schema = avro.schema.parse(open(args.col_schema, "rb").read())
+    if args.token_file:
+        read_in_token_file(args.token_file)
+    start = time.time()
+    for dsn in dsns:
+        print("Starting: {}".format(dsn))
+        try:
+            if not args.skip_collections:
+                dump_user_collections(col_schema, dsn, args)
+            for bso_num in range(args.start_bso, args.end_bso+1):
+                rows = dump_data(bso_num, schema, dsn, args)
+        except Exception as ex:
+            print("Could not process {}: {}".format(dsn, ex))
+    print("Dumped: {} rows in {} seconds".format(rows, time.time() - start))
+
+
+if __name__ == "__main__":
+    main()
--- a/tools/user_migration/old/migrate_user.py
+++ b/tools/user_migration/old/migrate_user.py
@ -1,15 +1,18 @@
 #! venv/bin/python

-# painfully stupid script to check out dumping mysql databases to avro.
-# Avro is basically "JSON" for databases. It's not super complicated & it has
-# issues (one of which is that it requires Python2).
+# This file is historical.
+# This file will attempt to copy a user from an existing mysql database
+# to a spanner table. It requires access to the tokenserver db, which may
+# not be available in production environments.
 #
 #

 import argparse
 import logging
 import base64
+
 import sys
+import os
 import time
 from datetime import datetime

@ -17,10 +20,13 @@ from mysql import connector
 from mysql.connector.errors import IntegrityError
 from google.cloud import spanner
 from google.api_core.exceptions import AlreadyExists
-from urllib.parse import urlparse
+try:
+    from urllib.parse import urlparse
+except ImportError:
+    from urlparse import urlparse

 SPANNER_NODE_ID = 800
-
+META_GLOBAL_COLLECTION_ID = 6

 class BadDSNException(Exception):
    pass
@ -103,6 +109,11 @@ def get_args():
        action="store_true",
        help="silence logging"
    )
+    parser.add_argument(
+        "--full",
+        action="store_true",
+        help="force a full reconcile"
+    )
    return parser.parse_args()


@ -149,7 +160,8 @@ def update_token(databases, user):

    """
    if 'token' not in databases:
-        logging.warn("Skipping token update...")
+        logging.warn(
+            "Skipping token update for user {}...".format(user))
        return
    logging.info("Updating token server for user: {}".format(user))
    try:
@ -192,14 +204,15 @@ def get_fxa_id(databases, user):
    """
    sql = """
        SELECT
-            email, generation, keys_changed_at, client_state
+            email, generation, keys_changed_at, client_state, node
        FROM users
            WHERE uid = {uid}
    """.format(uid=user)
    try:
        cursor = databases.get('token', databases['mysql']).cursor()
        cursor.execute(sql)
-        (email, generation, keys_changed_at, client_state) = cursor.next()
+        (email, generation, keys_changed_at,
+         client_state, node) = cursor.next()
        fxa_uid = email.split('@')[0]
        fxa_kid = format_key_id(
            keys_changed_at or generation,
@ -207,11 +220,11 @@ def get_fxa_id(databases, user):
        )
    finally:
        cursor.close()
-    return (fxa_kid, fxa_uid)
+    return (fxa_kid, fxa_uid, node)


 def create_migration_table(database):
-    """create the syncstorage migration table
+    """create the syncstorage table

    This table tells the syncstorage server to return a 5xx for a
    given user. It's important that syncstorage NEVER returns a
@ -271,6 +284,47 @@ def mark_user(databases, user, state=MigrationState.IN_PROGRESS):
    return True


+def finish_user(databases, user):
+    """mark a user migration complete"""
+    # This is not wrapped into `start_user` so that I can reduce
+    # the number of db IO, since an upsert would just work instead
+    # of fail out with a dupe.
+    mysql = databases['mysql'].cursor()
+    try:
+        logging.info("Marking {} as migrating...".format(user))
+        mysql.execute(
+            """
+            UPDATE
+                migration
+            SET
+                state = "finished"
+            WHERE
+                fxa_uid = %s
+            """,
+            (user,)
+        )
+        databases['mysql'].commit()
+    except IntegrityError:
+        return False
+    finally:
+        mysql.close()
+    return True
+
+def newSyncID():
+    base64.urlsafe_b64encode(os.urandom(9))
+
+def alter_syncids(pay):
+    """Alter the syncIDs for the meta/global record, which will cause a sync
+    when the client reconnects
+
+
+    """
+    payload = json.loads(pay)
+    payload['syncID'] = newSyncID()
+    for item in payload['engines']:
+        payload['engines'][item]['syncID'] = newSyncID()
+    return json.dumps(payload)
+
 def move_user(databases, user, args):
    """copy user info from original storage to new storage."""
    # bso column mapping:
@ -308,8 +362,8 @@ def move_user(databases, user, args):
    )

    # Genereate the Spanner Keys we'll need.
-    (fxa_kid, fxa_uid) = get_fxa_id(databases, user)
-    if not mark_user(databases, fxa_uid):
+    (fxa_kid, fxa_uid, original_node) = get_fxa_id(databases, user)
+    if not start_user(databases, fxa_uid):
        logging.error("User {} already being migrated?".format(fxa_uid))
        return

@ -356,6 +410,8 @@ def move_user(databases, user, args):
                values=uc_values
            )
        # add the BSO values.
+        if args.full and collection_id == META_GLOBAL_COLLECTION_ID:
+            pay = alter_syncids(pay)
        bso_values = [[
                collection_id,
                fxa_kid,
@ -383,6 +439,16 @@ def move_user(databases, user, args):
        for (col, cid, bid, exp, mod, pay, sid) in mysql:
            databases['spanner'].run_in_transaction(spanner_transact)
            update_token(databases, user)
+            (ck_kid, ck_uid, ck_node) = get_fxa_id(databases, user)
+            if ck_node != original_node:
+                logging.error(
+                    ("User's Node Changed! Aborting! "
+                    "fx_uid:{}, fx_kid:{}, node: {} => {}")
+                    .format(user, fxa_uid, fxa_kid,
+                            original_node, ck_node)
+                )
+                return
+            finish_user(databases, user)
            count += 1
            # Closing the with automatically calls `batch.commit()`
        mark_user(user, MigrationState.COMPLETE)
--- a/tools/user_migration/old/requirements.txt
+++ b/tools/user_migration/old/requirements.txt
@ -0,0 +1,4 @@
+wheel
+avro-python3
+google-cloud-spanner
+mysql-connector
--- a/tools/user_migration/old/sync.avsc
+++ b/tools/user_migration/old/sync.avsc
@ -10,4 +10,4 @@
     {"name": "modified", "type": "long"},
     {"name": "payload", "type": "string"},
     {"name": "sortindex", "type": ["null", "long"]}
- ]}
+ ]}
--- a/tools/user_migration/requirements.txt
+++ b/tools/user_migration/requirements.txt
@ -1,4 +1,3 @@
 wheel
-avro
 google-cloud-spanner
-mysql-connector
+mysql-connector