Wondering if anyone here has some advise or a good place to learn about dealing with databases with Python. I know SQL fairly well for pulling data and simple updates, but running into potential performance issues the way I’ve been doing it. Here are 2 examples.

  1. Dealing with Pandas dataframes. I’m doing some reconciliation between a couple of different datasources. I do not have a primary key to work with. I have some very specific matching criteria to determine a match. The matching process is all built within Python. Is there a good way to do the database commits with updates/inserts en masse vs. line by line? I’ve looked into upsert (or inserts with clause to update with existing data), but pretty much all examples I’ve seen rely on primary keys (which I don’t have since the data has 4 columns I’m matching on).

  2. Dealing with JSON files which have multiple layers of related data. My database is built in such a way that I have a table for header information, line level detail, then third level with specific references assigned to the line level detail. As with a lot of transactional type databases there can be multiple references per line, multiple lines per header. I’m currently looping through the JSON file starting with the header information to create the primary key, then going to the line level detail to create a primary key for the line, but also include the foreign key for the header and also with the reference data. Before inserting I’m doing a lookup to see if the data already exists and then updating if it does or inserting a new record if it doesn’t. This works fine, but is slow taking several seconds for maybe 100 inserts in total. While not a big deal since it’s for a low volume of sales. I’d rather learn best practice and do this properly with commits/transactions vs inserting an updating each record individually within the ability to rollback should an error occur.

  • atzanteol@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 year ago

    It’s a little hard to say without seeing your datastructure. Is this something like

    { header: {
      id: 1,
      items: [ {
          name: "foo",
          field2: "bar"
       } ]
    }
    

    If you have something unique in the “header” you can create 2 tables with a dependency.

    create table header ( id number );
    create table item ( id number, header_id number, name varchar, field2 varchar);
    

    You can generate IDs for each item on-the-fly but won’t be able to tie the back to the JSON. BUT if you can tie back header to the JSON then you can do a “drop-and-replace” on the items with each run. Which may not be the most efficient but it will likely perform better than querying each row upon entry. e.g. (pseudocode)

    for each header in headers {
       delete from item where item.parent_id = header.id;
       for each item in header.items {
           insert into item values ( some_id, header.id, item.name, item.field2 );
       }
       commit;
    }
    

    But if you don’t want to drop/re-create then if there’ some combination of things in the “item” that is unique then you can use that as a compound key. In the worst case you can just use all the columns. I once created a primary key that was an MD5 checksum of the string value of all the fields in the row. It gave me a calculable primary key which was good and I could query off it easily. But it does make expanding the table much harder…

    The advantage of drop-and-replace will be that removed items in the JSON will also be removed in the database. Otherwise you’ll need to do some additional cleanup to find database entries that don’t have an entry in your JSON file(s).