To give you the best possible experience, this site uses cookies. Review our Privacy Policy and Terms of Service to learn more.
הבנתי!
Player FM - Internet Radio Done Right
Checked 7+ y ago
הוסף לפני seven שנים
תוכן מסופק על ידי Molecular Coding. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Molecular Coding או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.
Player FM - אפליקציית פודקאסט התחל במצב לא מקוון עם האפליקציה Player FM !
Whether you’re just beginning to explore the Western United States or you’ve been living here since the day you were born, the Via Podcast will introduce you to new and unique adventures that will change your perspective. Hosts Mitti Hicks and Michelle Donati bring their travel expertise to interviews with some of the West’s most fascinating experts, residents, and adventurers. In each episode, you will discover deep conversations in the hopes of igniting a new interest—foraging anyone?—or planting the seeds of a new-to-you road trip. You might even learn something about a place you’ve explored dozens of times before.
תוכן מסופק על ידי Molecular Coding. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Molecular Coding או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.
Molecular Coding covers the software, developers, and programming problems which go into the science and algorithms of computational chemistry. If you’re interested in the software that deals with life sciences at the atomic level - cheminformatics, molecular modeling, even bioinformatics - then listen here!
תוכן מסופק על ידי Molecular Coding. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Molecular Coding או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.
Molecular Coding covers the software, developers, and programming problems which go into the science and algorithms of computational chemistry. If you’re interested in the software that deals with life sciences at the atomic level - cheminformatics, molecular modeling, even bioinformatics - then listen here!
In this interview recorded 9 June 2011, I talked with Igor Filippov. He’s the main author of ORSA, the Optical Structure Recognition Application. It extracts chemical structure information from printed material, and is used in a number of chemistry applications. Igor works at the NCI/CADD group of the Chemical Biology Laboatory at the National Cancer Institute. AD: Hello my name is Andrew Dalke. Welcome to episode four of "Molecular Coding." AD: In the summer of 2011 I went to the ICCS Conference in The Netherlands. That's the International Conference on Chemical Structures, which meets every three years. It was both enjoyable and informative. AD: Many people mentioned that they used OSRA to extract chemical structure information from printed material. The OSRA author, Igor Filippov, was also at ICCS. We got a few minutes during a break to talk more about OSRA. AD: This interview took place in Noordwijkerhout, The Netherlands on 9 June 2011. *music* AD: I'm here at ICCS with Igor Filippov. He's the main author of ORSA, the Optical ... IF: Structure Recognition Application. AD: It's the graphics program that's been used actually in a number of presentations here, plus of course you had your poster. Could you tell me a bit more about it? IF: OSRA is a project that started about four years ago in 2007. Basically it's a utility to convert images of chemical structures such as found in articles and patents into SMILES, SD file, or pretty much anything that Open Babel can produce, [like] InChI and InChIKeys. AD: How did you get started with it? Was it something your group started to work on or was it something that you were interested in? IF: It was just a hobby project at first. I got interested in it then it got developed and people started using it and it got more useful to others. It's actually quite a nice feeling to have something fun to work on that is being used by other people too. AD: When you say that it started out as a hobby project, you were doing this as a side part of work or was it just something totally different? - you saw chemicals and you thought "I would like to process that"? IF: Pretty much like that, yes. I heard about the general idea of software like that and I thought to myself "it's impossible to do this, it's too complicated." AD: That's what I think. IF: Then I tried it. I tried different vectorization algorithms, I tried OCR programs, and the I tried combining them together, because basically that's what it is. You take an image, you vectorize and get your bonds. You do OCR on atomic label and you've got your atoms. You combine them together and produce the molfile. AD: I saw in the list of programs that OSRA uses that you have a dozen or so different other tools. IF: I prefer not to reinvent the wheel and so I try to reuse what was done before me. OSRA has a very small codebase for such a project because it's basically a glue between various libraries that existed before. There are two different OCR engines for label recognition - JOCR and ocrad. There's a vectorization library - potrace - which does the roster to vector conversion. There's the graph processing library - GraphicsMagick. There's Open Babel of course for generating the output molecular structure. There's a couple other additional technical libraries that OSRA is using. AD: You use the two OCR readers, which I'm surprised that you're using not one but two different ones, and that's to work out the text. What you added to this then was the recognition of what double bonds are and do you do chirality and wedges and all that sort of chemistry? IF: Yes. Well, I should say that none of those libraries were particularly developed for chemical structure recognition, so they're good at what they do but they're not 100% aligned with my project, so there's a lot of things I had to modify, pre-process, to make it fit. For example, the output of potrace - the vectorization algorithm - it's not like you get bonds directly like a single vector. You have an assembly of vectors that will constitute a single bond but you still have to recognize that vectors 1, 2, 3, 4 it's actually a single bond and that vectors 5, 6, 7, 8 it's actually a part of a double bond somewhere else. Yeah, you have to do some pre- and post-processing from the output of this program. IF: Speaking of OCR, none of the existing OCR engines, especially open source OCR engines, are very good at recognizing single characters. Usually the focus is on recognizing the whole text. There you can do a lot of things with dictionary-based corrections where if you have a word partially recognized, the engine can correct itself, just having a huge vocabulary. You cannot do this, or not very easily do this, with single character atomic labels. I'm using two OCR engines because I feel that combined strengths leads to better recognition rate. As a matter of fact, optionally there are two more OCR engines you can compile in, so it can be up to four different OCR engines there. AD: Nice. AD: How does this handle superscript and subscript when you're doing isotope labeling or side groups? IF: Poorly. It does try to recognize subscripts, especially on Markush labels R1, R2 and so on. If the scan quality is good and the characters are sufficiently large then it can recognize it, but the smaller the character the less chance it will get recognized correctly. AD: It sounds like there are several different validation sets. There's various patent office data sets and things like that. How easy has it been to get the validation data that you need? IF: Not easy at all. Having the validation set is essential. Otherwise you cannot benchmark your performance, you cannot move forward if you don't know how well your version X is doing than X-1, for example. Initially I had a set of very diverse structures from the web, from articles that I had to draw the SD file by hand myself. Recently, with the help of John Kinney of DuPont and Steve Boyer from IBM I acquired this huge data set from USPTO of 6,000 molecules. That was absolutely essential to get OSRA to produce better and better results. Originally it was a Complex Work Unit initiative at the USPTO. They have people who redraw the structures in a molecular editor and save molfiles, so you have both image and corresponding molfiles. AD: When someone at DuPont, AstraZeneca - the people in the consortium - are working on their work, how much of what they do feeds back into what you do? You get test data from them. What about code changes and improvements in the algorithms? IF: There were some suggestions and recommendations, and maybe not directly code input but John recommended some improvements in the algorithm to recognize tables in the text. If you have a table it will throw off the recognition algorithm because it's also a linear graphic and looks kind of similar to a molecule. John made some suggestions. He coded something for himself. While I didn't take his code directly, I was absolutely using his recommendations and improved my recognition engine. IF: Right now there is another guy who's working on the code itself, Dmitry Katsubo, from the European Patent Office. He made tremendous input, especially since he's more on the formal programming side. He did a completely new compilation system where you don't have to muck around with makefiles by hand. It's regular autotools generated "configure; make; make install". AD: And that includes checking if all the tools exist and optionally compiling them in? IF: Yep. It's much easier now to produce [a program] out of source code. AD: I saw several people here using their iPods to take a picture of the screen during the presentation. If somebody wanted to write a tool that was to sit inside the iPod, take a picture, process it ... has someone done that? IF: Yes. AD: How hard is it to do that? IF: It wasn't really my project. There's a company, I believe the name is Eidogen-Sertanty, they have a tool that works exactly like that. You take a picture with the iPhone and get a structure back which you can edit in their own editor on the iPhone or iPad. I believe OSRA is running on a remote server so the processing is not done on the phone itself. They load the image, process it, and get the SD file from the server. AD: Because then you would have to distribute all those different libraries; download them and add them on the iPod. IF: Well it's possible. I think the main problem is not to compile the code on iPhone or iPod. The problem is the performance is probably not quite there yet. It might take some seconds to process an image. My feeling is that it would be much more efficient to have the processing done on a big server than on a small iPhone. So far. Maybe next year they will have better processors on the iPhone. AD: I saw on your CHANGELOG that you spent some time now optimizing the code to make it faster. IF: Yes. And also Dmitry was very helpful in that. I think we've done quote a good job compared to a version of a couple of years ago. We improved the performance by a factor of three or four. Some of the main changes were code refactoring, making the code more lean and efficient. Also, I changed from ImageMagick to GraphicsMagick which is compatible but much faster. Most of the improvement came from this small change. AD: Where's most of the time being spent? IF: There are two factors where OSRA is taking it's time. First of all, page segmentation - AD: Sorry, what is page segmentation? IF: Page segmentation. If you have a document where you have text and molecular structures all mixed together on the same page, you have somehow to extract the structure out of the rest of the page because, for OSRA project at least, you don't care about the text. We want to process only the structure. This process is called page segmentation. It's fairly time consuming. On the other hand I believe the OSRA algorithm is quite efficient, in that it can very often guess correctly that this is the text and we are not interested in processing that, and so we are not spending time on blocks of text or some photographs. Often there is some pictures of mice, for example, in documents. AD: Do you ever get a mouse identified as a compound? IF: Sometimes. It happened quite frequently actually in the past. Now it's getting better I believe. AD: I was also seeing that how the vendor sites mentioned they work with OSRA. You support plugins for Symyx Draw, and Chem BioDraw, and BKChem. How many people are actually using your code in the world? IF: I can only track the direct downloads. I guess we had one or two thousand downloads. We have two different distribution sites: SourceForge and our NCI/CADD web site so you have to combine them. If somebody wants to use it and they don't tell me, I won't necessarily know that they want to use it. AD: As the up-side of open source, you're using so many open source projects. The downside is you don't get quite the same feedback of people using it. I like going to conferences like [ICCS]. People come up to me and say "oh, thank you for this project" or "I like using that tool." IF: Yes, exactly. It's a nice feeling to hear that it's useful. People sometimes email me with questions or say "hey, it's a good tool," they're using it. From places quite unexpected such as the International Union of Crystallography, there's a university in Australia where they're using OSRA. It's a good feeling to know that something you are working on is being useful for others. AD: What would be the best way to support the project. Would it be developers, or test data, or people with image processing experience? IF: All of the above. The test data is absolutely essential and it's very hard to produce. It's hard to validate hundreds - I'm not even talking about thousands - of structures by hand, and it's necessary to have it and from as many diverse sources as possible. We have good test sets from USPTO and Japanese patent office. It would be nice to have some similar tests from WIPO, from EPO, from Chinese patents. AD: Have you downloaded the images that are in Wikipedia? They have the structure and a link to the PubChem id, and many of those structures in Wikipedia are actually drawn by hand. IF: I have not done this. AD: I just learned this a day or two ago about people doing stuff that way. Finding good images or finding correlations between, say, CAS id and SMILES by going through Wikipedia to look up PubChem to get the actual data they want. IF: I have not done this. This is an interesting idea, yes. AD: You're working at NCI/CADD. What do they do and how do they support your work? IF: Nowadays it's officially part of my responsibility. Before it was more a hobby project. It's not that I'm spending full time working on OSRA. I have other responsibilities there as well, and other projects. I'm working under the direction of Marc Nicklaus. He was very supportive. He was very appreciative of the project. AD: How is this project funded? IF: It's funded along with the rest of the CADD group by NCI. AD: You were telling me you came into this field as a physicist. How did you get involved in doing image recognition of structures? IF: I was doing my PhD at The Ohio State University. There I got acquainted with Jan Labanowski, the maintainer of the Computational Chemistry List (CCL). I was working on CCL helping him administer and maintain it for a while. That's how I got connected with the world of cheminformatics. After graduation I joined Martin Nicklaus's group, and that's how I got myself into all of this area. AD: Were you doing software development since you were young? IF: Pretty much. In physics, for my PhD thesis, it was basically C++ and Mathematica because it was building theoretical models and doing calculations. I got involved with programming since I was 12-13. First it was BASIC and Pascal in school, then it was C and C++ and Perl. Now I'm interested in Python. It seems like a very interesting approach. Then a little bit of a lot of things. AD: Thank you very much for your time. It was interesting hearing more about OSRA. IF: Thank you. *music* AD: Thank you for listening to Molecular Coding. This podcast and transcript are distributed under the Creative Commons Attribution-Share-Alike 3.0 Unported license. The theme music was composed and performed by Andreas Steffen. I'm Andrew Dalke.…
In this interview recorded 28 March 2011, I talked with Bob Tolbert, the VP of Development at OpenEye about the software engineering processes that they use to develop their cheminformatics and molecular modeling tools. AD: Hello my name is Andrew Dalke. Welcome to episode three of "Molecular Coding." AD: Shortly after OpenEye's CUP user group conference in 2011, I dropped by the OpenEye offices to interview Bob Tolbert. He's now the vice president of development at OpenEye. I first met Bob at a Python conference many years ago. He was working for Boehringer Ingelheim, and showed off that their email addresses were so long that they didn't fit on the front of their business cards. Back then he was a OpenEye customer who had written Python wrappers for OELib so he could develop his software in Python instead of C++. That influenced OpenEye to hire Bob and move him out to their headquarters in Santa Fe, where he's been since 2002. AD: My other two memories of meeting Bob are that he had served in the Navy on a nuclear sub, and that he enjoys talking. That last observation is still true, which made the interview enjoyable, but it also encouraged me to put off typing up the transcript for this podcast for a long time. AD: This interview took place in Santa Fe, New Mexico on 28 March 2011. *music* AD: Welcome to another edition of Molecular Coding. I'm here with Bob Tolbert at OpenEye. I wanted to talk with him about software engineering in cheminformatics, especially software engineering at a company that develops cheminformatics software. AD: Can you introduce yourself and talk a bit about what OpenEye does? BT: I'm Bob Tolbert. I'm the VP of Development at OpenEye. OpenEye writes tools for cheminformatics and molecular modelling. We sell both toolkits and applications. A big part of our business is the fact that we sell toolkits both across cheminformatics and modelling, etc. in C++, Python, Java and C#. AD: One of the things I wanted to start off with was talking about toolkit development. There's the question I've had talking with other people is, how do you develop APIs? I mean, there's a dozen or so software libraries that exist [in cheminformatics]. Some of them are good and easy to use, some of them are kind of more complicated. How do you go about starting a new API, like when OEChem started or the depiction library? BT: I think the first thing that drives the API design is the core functionality. Our desire to think about how you don't want things like they were in the past. OEChem is, if you will, like a third generation toolkit. You think of things that maybe even predated Daylight are the first generation, and OELib was a bit of a second generation, trying to use object-oriented [programming] but in a non-abstract way. A lot of the design of OEChem was done to prevent the sins of the past. To avoid things like leaky abstractions. AD: Do you actually do out and look at existing APIs and how they worked or ...? BT: What we did with OEChem was look at some existing stuff. You have to realize we started OEChem in 2000/2001. C++ and template support in compilers was relatively new, and there was a decision to do C++ and do it right. Again, some of the things we wanted to avoid were a molecule object which had too many member functions. It was kind of overwhelmingly large and bloated. It used templates not for template's sake but where they made sense for things like predicates and iterators and other things. To remove as much as we could of things like internal implementation from the API. Because if you let those abstractions leak out and people start using back-doors, you're stuck and can't change. This was a fundamental flaw in OELib. BT: In fact the OEChem API went through I would say a year, year and a half of real argument, real let's-try-again, real throw-it-all-away-and-start-over because it was a clean slate, to some extent, from the API point. What's going to be in a molecule? What's going to be in an atom? Do molecules own atoms or do they just know about them? These kind of things were all core decisions, and in the API we're going to use C++, we're not going to expose STL iterators and things like that. We're going to use our own iterators and make them work the way iterators should work. BT: Once you build that core then 10 years later [when] you want to design a new library, you can still use that same philosophy to design the next API. If I wanted to design a new API for depict, or a new API for a new library, you can go back to those design principles. There's enough people here now who are bought into those principles that you can get a group of people together and say 'Okay, let's talk about this new design' and it kind of works through argumentation and experimentation. Even some things we've done very recently; we did it all, we wrote it all up, we made it work, we used it for a while, we started writing examples and go "This doesn't feel like OEChem. It doesn't feel right." AD: Would the depiction library re-write be like that? BT: That's one of the things we've done. Because the very original API was something that a bit put together out of other pieces, it never quite felt like other things. In the rewrite, what we were trying to do was manage completely replacing the underpinnings so the pictures are better and more flexible and can leverage a lot more stuff that we know how to do now. But not make the API so drastically different that people didn't know what to do. Then when we got it mostly done, we said 'well this is great. We fixed the pretty part. We can make better pictures we can [???] but we've still got this kind of not-great API.' We're expecting people to learn something new, so they're going to learn something new anyway. We might as well bring them all the way and say 'well they're probably an OEChem user - to use this they have to be - so why not push it further down the road of making it feel like it belongs as part of that group?' And further break the tie to the previous API. I think that's something that still a work in progress but I feel like we're getting closer. AD: When you're doing the API development, you're saying that you build other tools based on the API for internal use, just to try it out - BT: I think you can't really know if an API is great until you write the documentation and you write the examples and you try to explain how to use it to somebody else. When you do that - before you even actually show it to somebody else, just the act of explaining it to yourself while you're writing that documentation, or writing that example - you find a hole, or a flaw, or a 'wow! I have to do these three things in the right order or it doesn't work; is that a good thing?' Or 'why do I keep cutting and pasting these eight lines into every example? Why is that not a single free function that does those eight things in the right order every time without me having to worry about it.' AD: That explains some of the evolution of the toolkit I've seen over time, such as the high-level functions for assigning aromaticity appropriate for SMILES. I used to go to the web site and graph those 'eight lines.' BT: There's a lot of that stuff. Depiction used to have this set of standard things that you had to do. It was boilerplate. You could copy it out of the example. Or you could just write a function that did it. A lot of users wrote that function and called it themselves. We just added it to the API. Now we have some very high level functions that do that. BT: You know, the important thing is to not only write the high-level functions, because of the stuff that people have. They want to do something different, they don't want to do stero that way, they don't want to depict this or that, they have some special case. The key is to write the low-level open enough that people can be pretty flexible, and then write the high-level when you don't want to have to think, or you want to go and write the simple example. AD: How do you balance out the needs of the people who want low-level and performance, and the needs of the people who want high-level and not worry about it? BT: I'm not saying that high-level is not performant. I think that most of the high-level stuff is written to be performant. We don't right high-level stuff that does extra work just because it's easy. We don't really do that. AD: I can actually give an example of my teaching of the OpenEye code versus, say, the RDKit code and other libraries. OpenEye has the ability to take a molecule and then OEParseSmiles parses a SMILES into the molecule, where as most every other library has "given a SMILES, make a molecule out of it." The OpenEye tools are much faster because they can just reset the molecule rather than parsing into it. I think it's more complicated there for me to teach people that you have to do this two-step process. But I can measure a 10% faster performance just because of doing that. BT: Yes, I think that's probably true. And there are other things where sometimes to protect people at the first level you have to re-initialize an object rather than re-use it because they don't clear it out. There are a number of things that do that. But I think we tend to write things in reasonably modular level where we decide 'this is a function that other people might need in the company, 'this is a function that only I need because it's an implementation detail and I'm writing this as a function just to break it out', versus 'this is a function that a customer would need to actually be able to do this.' We have three levels. We even have discussions about where different APIs go. Is this going to be a public API, is this going to be an implementation API, is it going to be a private API? BT: One of the advantages of a private API is not so much that we're hiding stuff from people - it's not hiding performance or anything else - it's that sometimes you don't know, right? Because this is the other example: Let's say that we come up with a new function or a new feature and we're not sure yet whether this is going to be public; it's going to be part of the toolkit. If you make it private, other people in the company, other products in the company, can start to consume it. You could have some kind of internal customer feedback. Then when you're [???], you can say 'oh, well, I'm using it but it doesn't work the way you think it does' or 'it doesn't work the way I need it too', you can refactor it internally first and then push it into the public API. Then you've got a better chance of having the public API that's stable, because people are not going to have to change. They won't be having to give us so much feedback [?incomprehensible?] the way it was, the way it used to be. BT: Particularly for depict, this is important too, this is such a big API change. We're not changing it, we're actually creating a new API. The old one is still going to be there for an interation or two so that people's code will still build against the old API. The new one is all new classes with all new names. They can learn, they can port, and slowly we'll deprecate the old one. AD: You mentioned documentation when developing new APIs. How easy is it to tell people [to] document all the code they do, since that's not a very - software developers don't usually like writing documentation. BT: That's kind of like telling a kid to make their bed and brush their teeth. They know what's good for 'em but it's not always what they want to do. It is a battle. I think that we are getting better, because we've gotten to a size now where most of us are consumers of somebody else's code. There was a time when we were small enough that we all wrote the code and consumed the code ourselves. At that point documentation was only extra work. But the minute you, as an internal customer, say 'I want to use this library that I didn't write' and you have a choice between digging through a header file or opening up a Python wrapper in an editor, or going to the documentation, then you realize 'holey-moley! I can go to the documentation. There's the function, there's an example usage - I can even cut and paste this code as example code and move on.' You do that as a group, I think everyone's had enough of those experiences and they realize they've benefited from somebody else doing documentation, it's a little easier to do it yourself. AD: If someone from OpenEye is working with the documentation, and finds that it's not well-written enough - they find they don't understand how to use it, for instance - what's the process for updating? Are the ones who discover the problem responsible for saying 'look, here's a thing', publish it in the bug database, or get ahold of the original developer and ask them to - BT: We track documentation bugs in the bug tracker just like we track toolkit bugs, or any other bug. If there's a thing there ... Sometimes people have a unique position. They may not be the person that wrote the library, but they're writing an example for another app. It shows up something particularly interesting about it, well they can go do it. They can and they will do that, because they have a unique view of this particular thing. It doesn't always necessarily have to be the person that wrote it that understands how to explain it. BT: I think as a group we've been pretty happy and pretty successful at doing, I think, a decent job. It's always getting better. We're starting to get decent feedback from customers in recent years that notice that we've made it better, and of course any time you get positive feedback, that reinforcement makes it easier to go and write the next amount of documentation. We get to the point where now we have too much, and people are unhappy that it's too much, and you know you can't make people that happy. I think the other important thing is you have to find a format that is a very low barrier to people doing it. If there's technical problems with writing documentation, they don't want to learn. We used to do it in LaTeX, and we used LaTeX2HTML to generate the HTML, and went to PDF with the PDF version. That was great, if you knew LaTeX. If you didn't know LaTeX, it was bad. If you didn't want to learn LaTeX, it was really bad. I agree too, that you spend all day working on code and dealing with the C++ compiler, to have to then spend your documentation time dealing with the LaTeX compiler; it's pretty annoying, and so I kind of agree with that. But most cases when people want to complain about these things, the easies thing to say is 'okay, well find me a better solution. If it works then you can win; we'll do that instead.' AD: Are you still using LaTeX or have you switched to something else? BT: We've switched to the Sphinx stuff. In fact the previous iteration that used LaTeX was based of the then-standard way that Python did their documentation, which I always thought was pretty useful. When we went looking for a replacement for LaTeX, which wasn't just because of LaTeX. The LaTeX2HTML protocol was getting pretty long in the tooth and no one was using it any more and the results were kind of ugly by modern standards. I say 'well, let's go look to see what Python is doing.' This was a couple of years ago and Python had switched to Sphinx, and their documentation looked just that much better. You go look at Sphinx and it's somebody who finally managed to take reStructuredText and make it usable and not yucky and almost fun to do. It's fun because the results are so awesome. They look great! So we went down that path. I think people are still not happy with having to learn a new language, but if you complain about learning reStructuredText then I don't know what else you expect to do except speech recognition - AD: It should recognize quill, ink, and paper. BT: Well, people wouldn't want to necessarily probably do that either. I'm not saying it's not a hard problem, but it is worthwhile and we get enough good feedback from people that they recognize the documentation's useful. AD: What I find interesting about OpenEye is that there's people doing sales and support and there's developers, but you don't have, for instance, someone specializes in doing documentation or someone specializes in doing QA. I think that's pretty unusual for a software company. BT: It probably is unusual, and it won't last forever. At some point we will have a full-time QA person. I think as it is now that's not a very happy person. And I think at some point QA has to evolve. A single QA person seems like the most hated person around and would not go to lunch with people. It probably would be short-lived, because I think all they do is just make everybody mad by telling them that stuff doesn't work. It's probably a misunderstanding on my part. But we will drive toward that simply because it's impossible at some point for people to test some of the more complicated stuff themselves. It's like proof-reading your own resume - that's just a really bad idea. AD: You've that the problem that of course all scientific software has is that some of the stuff, how do you test it? If you're developing a new forcefield or new something - BT: Yes. There are objective tests and there are subjective tests, and "it works the way I think it ought to" versus "the way the world works." Those are apples and oranges, right? If you know what you expect the thing to do then you can write tests and say "this is what I think the answer is." If you're using a force field, you're using an approximation, you are ... for example, one of the things that's new is that we calculate the entropy. It's a [???] rather drawn out process. It's not accurate to the Nth degree like a quantum calculation, it's totally usable; there are known limitations. So then you can't say 'I'll go look up the entropy calculated in some book and put that number in my test and if I get it the test passes, if I don't the test fails.' But what you can do is - once you're finished with the science, once you've published the paper and established the kind of things we ought to get for test molecules - you can turn that into a test to make sure, well, 'if I run it on Linux, and I run it on Windows, and I run it on 32-bit or 64-bit or I call it from Python or Java, do I get the same number?' So I don't have extra errors from floating point roundoff or some other wierd thing going on with the system. At least those kinds of things catch consistency between the platforms, and in 6 months from now, when someone goes and changes one of the moving parts, we suddenly start getting different answers because they changed the optimizer or they changed some other low-lying code. Or it could even be changing something in OEChem that changes the chemistry model that affects the force-field parameters and therefore different answers. Well, you need a test to catch that. You don't want to be surprised that then entropy calculations are different, for a reason that has nothing to do with end of the spectrum. AD: Have you then developed a bunch of regression tests? Say, 'here are all these validation tests, that when you update a compiler or update the OS you then run through all that stuff? BT: We have it in C++, the lion share are actually in Python, because that's the consumer API, it's the easiest way to write tests, it's the lowest barrier, again, to people who want to write a test is to be able to go and do it in Python. We also have them replicated in Java, and C# - it's not as many I think of those yet - because you still need to test when we wrapped the C++ into Java does it work the same way as if you wrap it in Python. BT: At the lifecycle of a software package that's shrink-wrapped and put on a shelf, versus the speed at which we come up with new science and try to put it out. The testing is done in parallel with the documentation, is done with the development, and done in parallel with the science. Every bit of test code you write is code that's not writing the algorithm and so it's a balance. You have to make sure you do both. BT: I also think most everybody here, anybody that's older than a certain age, that didn't start working out with automated test suites, has to come to the realization that this is a good thing. They come to that because it saves their butt one day. They put in this test, or sombody else puts in this test, and then they run something and all of a sudden the test fails. They go look at the test fail and go 'Oh wow! I didn't even have to debug this problem. I made an error, the test caught it, I go back to my code, and I fix it. I didn't have to spend 5 minutes or 5 hours, or 5 days in the debugger trying to find this wierd crash, because the test picked it up. AD: I've had the difficulty when I do API development of: if I try to push the tests too early - I've had this argument with people who try to do Test Driven Development where they do the tests first - I don't know what the API is going to look like, so the tests work with API and I change the API and I have a problem that my desire to try to get full, comprehensive tests at the same time I'm doing the API changes, I'm doing the API development, makes it very complicated for me. [Note from the transcriptioniist; That paragraph came our rather poorly! -- AD] So how early do you do the tests? BT: Wee write examples and tests as all-up C++ programs, so they're not built into a big automated test suite, they don't test every stinking little API point in some unit test way, because that does just [bind you into?] time later. But you certainly can write a functional test early that, you know, if I give it this molecule, what do I get for the entropy? Those first tests are the things you can do from the beginning, and they shouldn't be too much effort to go change if you decide the API needs to [bend?]. The other thing to realize is if you make big complicated object APIs, you're buying into your own trouble. If you keep the objects simple, if you put a lot of your functionality into free functions, then you're first API is really small. That actually works. Even if you did go write every test, you're not writing 100. You're maybe writing 10. You haven't bought the extra work until you need it. AD: By example of 'heavy' you mean the OELib style where you had 'isMethylHydrogen', 'isOxygen' [and] all these other methods on the atoms, for instance? BT: Right. There's two problems with that. That's not extensible. If you want to add a new method to that, now you have to go and change the atom API to say it's now this kind of atom. You can decide at some point 'that's silly', so you can do what we do, which is use predicates - free functions that operate on the atom. Those are infinitely extensible; it's easy to add another one. The worst case is when you add a bunch of stuff to the atom API and then you decide it's gotten too big, so now you start having predicates, and now when people want to do something they have to decide 'is this a member function or is this a free function? Where do I look?' BT: Now, it caused some early confusion to OEChem users. People are used to typing 'mol.' or 'atom.' in an IDE and have it show every method that you could ever do to an atom or a molecule. That's great: until that list is 300 things long and you still can't scroll through it in Eclipse or Visual Studio in any rational way. The flip side is to have no members on the atom and they you have to say 'what methods work on an atom? I have to look through the whole API.' I get that. That's why we don't give people a big list, and that's why a lot of the OEChem manual is broken out in functional areas: "I want to do MCS", "I want to do substruture search", "I want to do reactions", "I want to do whatever" and you go look in those sections and you find the functions which are focused around that, and not just every function that operates on an atom. AD: On last question about testing. How much coverage tests do you have? How much of your API is covered? BT: I don't know. In Python, for certainly the older toolkits, it's pretty big. They do touch a remarkably large part of the API. The newer toolkits are obviously further behind and in Java and C# they are further behind as well. But I don't actually know the number. AD: OpenEye also develops some applications, both command-line tools and graphical applications. I'm curious, when you talk about APIs, as a programmer I go 'I understand how to do that.' When you switch into user interfaces and GUIs, how do you test that? I have no good answer for that. Do you? BT: I don't think the world has a really good answer for that. I think that's one of the things that's going to drive us to have more full-time QA people; people whose job it is is to figure out how to break it, which you're particularly good at, as a hobby - AD: Thank you. BT: - is different than the developer who's going to go through a standard path, because he's trying to make sure something works. It's got to be somebody different than the developer, when you get to GUIs. BT: We use Qt for GUI development. There is a decent amount of testing frameworks to do that with Qt. By the time when you add that we have the scripting language underneath, we actually can do a lot of testing of our GUI apps by writing Python tests that run inside the interpreter and actually call the same functions that the GUI parts do. You can do some stress testing and some big picture stuff running things inside that Python interpreter. [If] you don't have that, it is a lot harder. BT: From the command-line, you could go crazy and write stuff that calls every argument with every other argument with random stuff to find things. We don't tend to spend a lot of time chasing those problems. Mostly we write functional tests with commands that we know are reasonable set and have expected input and expected output and make sure that we get the same. One of the things we've spend a lot of time on in the last few months - it's part of the ongoing documentation toolkit effort - is that we're rewritting all the examples; in C++, in Python, in Java, in C#. Simplying, focus them on specific tasks. Then we have tests which run all four of them, with the exact same inputs, and checks that they give the exact same output across the entire suite of things. So not only does C++ continually reproduce the same answers, but if I run the exact same example in Python do I get the same answers I got in C++. AD: Right, because you have the combinatorial problems of different operating systems, different compilers on those different operating systems, and different languages on top of those. How do you test? You support, what, a dozen plus different architectures? BT: Actually the world is simplying that for us, but yes. The test have got to be automatable, they gotta be in the tree, they've got something you can type 'make test' while you're doing it. AD: Do you have all the machines here that you can test them on? BT: We don't support any architecture on a machine which which we don't have running 24/7. AD: A real machine or a virtual machine? BT: We use virtual machines for some of them, simply because it makes good sense to use virtual machines. Our machine room used to look like the shelves at Best Buy. Every one was different, and those were a pain to keep track of. We've switched in some cases to virtual machines for build machine. That actually doesn't work too well when you want to go build build-machines and test-machines for GUI stuff. Some of the newer, more popular, more common Linuxes we have real desktops with 3D video cards that are used for testing GUIs, but the back-end stuff - all the other variations - they're all machines that you can get to. BT: They're not the kind of thing where you say 'we're going to make this machine dual-boot between, you know, RedHat 5 and SuSE 10'. When it comes time to go do a test, well, one guy wants to test on one platform. We've had that in the past. It just doesn't work. So we want to have every platform available all the time to every developer to go and run the tests on. AD: Who then does the system adminstration of all those different architectures? BT: We do. It's not that bad. There's one or two of us that can do most of it. I think system administration is one of these things where you have to say 'what are our choices?' We could hire a full-time sysadmin. It wouldn't be a programmer, it wouldn't be a scientist, it wouldn't be one of "us", and he would hate us and we would hate him or her. It would be the adversarial relationship that exists in most other places where you have a full-time sysadmin trying to baby those machines. Since all of us would hate that - we're not at the point where we're willing to give up control and give up the ability to know what's going on; we're not there yet - so then we have to make this kind of pact and say 'this is the lessser of two evils. Yes, we have to do the work, but yes we have control over what gets done.' I think that by spreading that out - we've spread it out over 4 or 5 people - it's actually worked reasonably well. AD: Even though you have to have people that are more jack-of-all-trades; doing some documentation, doing some system administration, doing some programming - BT: If you hire good people you can do that. You can't hire just anybody. But if you hire good people they can do all the pieces. And I think there's an advantage to doing all the pieces because then you appreciate the whole process in a different way. AD: Most people who come here come from chemistry, so there - BT: That's not necessarily true! We have a good mixture of folks who have computer science backgrounds as well as chemistry backgrounds. If they come in more as the scientific/science-side developers, then yes, they probably have a PhD, they did some coding in their degree and their postdoc. They have to get up to speed in C++. Maybe they were Fortran guys before. Maybe they were C coders before. They do have a pretty good uphill battle to get up to speed on C++ and up to speed on everything. But that doesn't mean they can't do it and they don't want to do it. They have done quite well. Then we've hired people on the other end. They don't know chemistry but they are GUI programmers or they are graphics programmers; those kinds of folks. They have a computer science background. They don't have a problem picking up the algorithms, they don't have a problem dealing with the tools, they don't say "CVS, how do you spell that?". It's not a problem. The balance of all that has worked really well. BT: You're not going to go hire somebody who doesn't know anything about cheminformatics or graph theory or anything - works as a quantum mechanics person and wrote Fortran - and come in tomorrow and say "Okay, you're now the lead developer on OEChem responsible for all the wrappers." That's not going to happen. That's silly. But we have a wide spread of stuff that we do. A lot of the stuff is in physics on one end, a lot of the stuff is graph theory and cheminformatics on the other and we tend to put people on the pieces where that's what they do. Then they grow up into the other areas when they've found things they're interested in. AD: How do you structure the software [development]? Say, cheminformatics. People are doing OEChem and depictions and things like that. Do you have a group that's around that, or is it more fuzzy and some people work 80% on this and 20% of the time on that? BT: It's pretty fuzzy. We have more products than people so we have multiple products per person instead of multiple people per product. That's just the nature of our business model and us. We don't have a lot of hierarchy when it comes to groups within groups and group leaders. It's mostly a bunch of programmers and most everybody knows what they're responsible for. Everybody knows what part they own. Everybody has the freedom to go digging around in other places if they have a problem. The advantage of everybody seeing everything is that somebody in Boston who's two hours earlier than us can have a problem at 8am Boston time and not have to say 'well, I'll guess I'll have to wait until somebody in Santa Fe gets to work and can go look at this problem in OEChem.' They can get go work. If it's a core algorithmic thing, they're not going to just chuck it in without checking, but they're capable and totally able and empowered to go say 'oh, well here's the problem.' The person here, who's ultimately responsible, will just come to work and find an email in their inbox saying 'here's what I did, here's the crash', or the bug or what doesn't work 'and here's what I think is the fix. Is it okay if I check this in?' AD: Did this sort of corporate culture start when OpenEye first started or was it something you had to work at? BT: I think it's been here the whole time. Well, in the very early days it was one product, one person. Each person when they came to OpenEye came to do that thing. Joe Corkery came to do VIDA, Mark McGann came to do FRED, Matt came originally to do OMEGA. Each person managed their own piece. That lasted for a few years but at some point the number of pieces outgrew the number of people, the number of ideas didn't require us to hire somebody new; we have the ideas in here, we needed to go do it. Then we started coming up with 'these are the things we want to do now, and what we want to do in the future; what kind of people do we need to bring in to do it.' But I think the culture of everybody being on the same page and part of the same team is driven mostly a lot by culture. It's also driven a lot by the fact that we manage everything together. We don't have a separate git repository for every little piece, that one person owns and no one else has any write-access to, and you have to go cherry-picking. You have access to everything in one spot. AD: If everyone manages everything, how do you manage the relationship with the customers. Is anyone free to talk with the customers to get the information they want? Or do you try to put that through one person? BT: We're pretty open to talking with customers. You have to balance letting people talk to customers only to the extent that it becomes a distraction that prevents them from getting work done. That's usually never the case. I think everybody here knows what they know and what they don't know. I really don't have any fear that somebody is going to go talk to a customer and tell them something that's completely wrong, and set a bad precedent. I think that that's just not a problem. Most people know what they know and what parts of the code they understand, and know who to ask if they get a customer request for something that's different. AD: How much of the direction of OpenEye is based on direct customer feedback, versus where you all think the science will go for the future? BT: I think that the science direction is driven a lot by Ant, and his desires and his ideas, and what he sees is the future for the industry and for us. - AD: So no molecular dynamics - BT: Well, there's a lot of things. Until they have a proven place in our portfolio, we're not going to do it. But you know, if you'd have asked anybody in this company five years ago would we have a fingerprint toolkit, or all the 2D we have, the answer would have been a resounding "no; are you kidding?". We now have a significant investment in 2D. Not because anything other than the fact that we've now come up with some reasonable ways to put it to work along side 3D; not instead of 3D. Now it's an important player, as part of our toolset, and not just something because somebody said 'can you guys do this?" We have a use for it. BT: That's the important thing. Because we have a close relationship with a lot of customers, it's not that we go looking for questions. It's not that we say 'we this idea for this new product, what do you think?' It's almost that we have a constant back-and-forth from user-group meetings and other things where we could come up with an idea but have to realize "oh, I talked with so-and-so at last year's CUP about that very same thing." They asked me, I came to the same conclusion, now maybe this is something we should do. That tends to happen an awful lot. AD: Is CUP the most important way to keep in touch with the users; [to know] what the future is going to be for the users? BT: I think it is. BT: We do get decent feedback via support from different things, but some people are funny about that. Some people love writing support and we hear from them regularly. Some people don't want to ask ever, want to figure it out themselves, and suffer in silence. We don't always know. We do spend a lot of time on the road; a lot of time visiting customers. Those are also opportunities for people to say "I had this problem three months ago and I never could make it work." The first question is usually "did you write support"" and the answer usually is "Ahh, I didn't have time for it." We understand that customers are busy. If something impedes their workflow right then, they're really good at working around it. Either using a tool or doing another job or just going to something else on their to-do list. Spending time to writing up a bug report and sending it in is usually not high on their list. I totally understand that, because we're probably all guilty about that in some other product, some person's tool that we use. Nonetheless, I think it is an important avenue for us to hear from people when things don't work. We do occasionally hear from people saying how great something does work, which is nice. CUP gets to be more of those opportunities because the customers can talk. You see a customer give a talk and he uses all your tools and you think "holey-moley - this is why I work here!" Because they went and took a tool that I wrote and turned it into something I never thought about and got some great utility in turn. One of the things we find, which is an important part of the toolkit perspective, is that - it depends on the company - if you were a turn-the-crank guy, in other words, you buy this piece of software, pour stuff in, turn the crank, results come out, at some point, in some big organizations - this is a corporate culture thing from our customer sites - that boss can say "what value do you bring to this equation? You're just turning the crank. I can hire a cheap person to turn the crank." Toolkits give people the ability to customize things, in a way that a canned-up application doesn't. By customizing things - coming up with a new workflow, coming up with a new algorithm, combining multiple algorithms in a way that no one has really done before - is a way for people to show why they are part of the equation at the customer site. And show why they are added value in the actual workflow in the company. Differently from just a turn-the-crank application. Some people really like that. It's a control thing for some people, it's because they've always been a software developer, or cheminformatics [thing?], but it also is that at the end of the year you can say 'I did this. I built this thing no one else has. It solves particular problems or it found particular leads or it did something that I couldn't have done otherwise.' That doesn't hurt. The customers are happy because they are progressing and advancing in their own company. That's a good thing too. AD: I'm going to end up with: How did you get in the field? How did you start off working on a nuclear sub, then you came into chemistry, and software development. How did you end up in Santa Fe, New Mexico? BT: You don't have enough time to get into the whole thing. It's a very weird path, getting out of the Navy and following an old professor who was in Idaho, realizing I was a danger to myself and my fellow man if I worked in a lab, switching to theoretical chemistry, landing in a pharma company doing this stuff because somebody had to do it and growing into cheminformatics and C++. It's a weird thing. I was a customer of OpenEye before I came here. I don't really know to this day how I managed to work this weird path. AD: I remember you did the Python bindings for OpenBabel - BT: - OELib - AD: OELib, right, which became OpenBabel. BT: I was a very early Python programmer because I was an even earlier Perl programmer. In grad school, and again in my first job, I wrote a Perl program that I couldn't read six month later. That was the last Perl I ever wrote. I switched to Python. At the time there was no OELib there was no nothing else. I had to write my own everything. We don't even think we had Daylight at that time in that company; didn't have access to anything. So I wrote my own little set of tools to do some stuff that needed to get done. After a while I got tired of that and wanted to use somebody else's tool. It kind of grew from there, but I knew I was going to be doing it in Python no matter what else mattered. Really I think the Python stuff is what got me here, it's not what kept me here. AD: So Python and doing chemistry is what took you all the way to Santa Fe, New Mexico. BT: Yeah, well, I guess that's true. AD: And maybe the scenery too? BT: Yeah. Yeah. AD: Alright, well thanks for the interview, and thank you all for listening to another edition of Molecular coding. AD: Cue music. *music* AD: Thank you for listening to Molecular Coding. This podcast and transcript are distributed under the Creative Commons Attribution-Share-Alike 3.0 Unported license. The theme music was composed and performed by Andreas Steffen. I'm Andrew Dalke. BT: The only reason I say this is one of the podcasts I listen to, the guy always promises to edit stuff out, and when he says that after you've just heard it, I realize: he didn't actually do it. He didn't even edit out him admitting that "oh yeah, we'll edit it out."…
In this episode recorded 7 June 2011 at ICCS 2011, I interviewed Paolo Tosco and the Open3DQSAR project. It is a tool for high-throughput chemometric analysis of molecular interaction fields. As of 17 June 2011 this project became free software; before then it was restricted to non-Europeans because the ComFA patent for Europe did not expire until then. I spent the last 4 hours transcribing the interview. Please let me know if you find it useful. AD: Hello my name is Andrew Dalke. Welcome to episode two of "Molecular Coding." A few weeks ago I was at the International Conference on Chemical Structures in The Netherlands. During lunchtime I met Paolo Tosco. He was talking with others at the table about how the ComFA patent meant that his 3D-QSAR package was available at no cost to most people in the world but it couldn't be released as free software until the European patent expired on 17 June 2011. AD: I asked him more about his project, and I was impressed by both what he's done and also with the amount of work he puts in to supporting the people who might use his software. He very kindly gave me a crash course in 3D-QSAR so I would know enough to be able to interview him about his project. The interview took place on the 7th of June, 2011. AD: If you want to try out his software, Open3DQSAR.org is for the tool that does high-throughput chemometric analysis of molecular interaction fields, its sister site Open3DAlign.org does unsupervised molecular alignment, and sdf2xyz2sdf.sourceforge.net does atom type assignment for the Merck molecular forcefield implementation in Tinker. *music* AD: Welcome again to Molecular Coding. I'm here with Paolo Tosco who has written a 3D-QSAR program. The software is Open3DQSAR and it uses OpenBabel and it's an open source project, or almost open source. And I thought I would talk about what is going on [and] more about the software components that go into it. PT: Yes, officially OpenQSAR will be fully open source in a few days because the infamous ComFA patent will expire on the 18th of June so I will be able to freely distribute the software under the GPLv3 license in a few days. Until now it was restricted to no European countries because the patent expired in the USA so I could distribute everywhere but in like twelve European countries. It will be truly open source. AD: Okay, and since I didn't know much about 3D-QSAR I asked Paolo to go through and walk me through the steps, and so I think we'll just do that now. How do we use your tools and do science? PT: So basically 3D-QSAR is a way to derive a correlation between the activities of a set of molecules and their 3D properties in terms usually of steric and electostatic potential fields. You try to spot out the zones of each compound which drive better or worse affinity for your target. This technique is especially useful when you don't know very much about the target or maybe you don't even know what the target it. So the only experimental information you can use is the one which comes from ligands. AD: So you start off with what, a SD files that contains "these are active" and "these other.." PT: Yes. You can import an SD file which of course contains the 3D structures and along with the 3D structures there could be a field in which occurs an affinity or pIC50 or whatever. If you don't have an SD file which encode the affinity you can import an external text file [with] the affinities provided; of course in the correct order and that they match the molecules in the SD file but this is obvious. AD: I know the next step in this was to do the conformation generation. I was looking through the mailing list and it looked like the conformation generation wasn't even added until last fall some time. PT: The point of the conformation generation is a bug problem, that every knows because it's not very difficult as long as you take into consideration open structures but it becomes a pain when you have to sample for instance the ring systems. It's definitely not easy to cut a ring and sample the conformation and then rebuild the ring. It's not to beat a good Monte Carlo algorithm for instance. The way I decided to implement conformation searching was using quenched molecular dynamics so that is molecular dynamics at high temperature. The high temperature should be high enough to overcome the torsional energy barriers but too high for instance to enantiomerize chiral centers or isomerize double bonds or these kinds of things that you don't usually want to happen. AD: Most people who might want to do conformation search might use say Corina or the OpenEye tools to do this. You decided go ahead and implement your own conformation generation tool. How did you go about doing that and how good of a replacement is it.. I guess, for the purposes of doing 3D-QSAR? PT: The point is that I wanted the whole suite did not include any kind of either closed source program or any non free program so I had to rely on either free available tools or something I was coding on my own. The nice point is since I saw there are not too many; basically there is no Merck forcefield based tool on the market which is completely free. My only choices were OpenBabel and Tinker. While OpenBabel can accomplish basic energy conformation calculations using the Merck forcefield of course Tinker offers a much larger palette of possible calculations including also implicit solvent models which proved to be very handy when you're dealing with conformations in a solution of molecules. My choice fell on Tinker. Merck forcefield was implemented in Tinker a few years ago but a big part was missing in this implementation because basically there was no tool to do the automatic atom typing of the input coordinate files. The first step was to code something which was able to do the atom typing. Since I didn't want to reinvent the wheel from scratch of course I relied on the code which was already existing and already GPL'ed and that's the OpenBabel code. I fixed a few bugs which were present in the Merck forcefield assignment code and I added the part which converts the 99 forcefield atom types in the 214 Tinker MMFF-like atom types. AD: I've never actually implemented MMFF but that defines the atom types. We have SD file comes in, you fixed the OpenBabel code so it does the correct MMFF atom type assignment, and then you had to translate the MMFF type assignment into the Tinker types. PT: Exactly. The Tinker atom types enrich the plain Merck forcefield atom type because they add atom types which kind of encode part of the charge inside the atom type itself. It's a richer description of the original Merck forcefield atom type. I talked with Professor Ponder about this fact and he was interested as well in a tool which was able to produce an XYZ file containing Merk forcefield atom types according to Tinker format. That's why I also released as a side dish a stand-alone tool which is able to convert an SD file either single or multi-molecule in either one or a set of XYZ files ready for accomplishing calculations inside of Tinker. You can find this other tool which has a cumbersome name. I'm still asking myself why I used such a ... *laughs* AD: That's the sdf2xyz2sdf. PT: Yes. You will win a prize for being able to spell it without ... *laughs* AD: It is a mouthful. PT: Yes. And this is just a stand-alone tool to accomplish atom typing. AD: So that doesn't even use OpenBabel at all? PT: It does. To make things easier I did not really directly link to the libraries but actually what the tool does calling the real binary so you bypass many of the problems you can face; I have an old library on my system, so the tool includes the binaries for any operating system so you can use it quite comfortably and you don't have to care about the fact that you may have an updated version or not on your own system. AD: I'm also curious about how it was to integrate with Tinker because the one thing I know is it's written in Fortran and the other thing is I also know most of the times you use Tinker you have to get it from ... It's not freely available. I mean, it's available at not code but it's not a GPL or BSD license. PT: That was also my concern because of course you can alway use an external tool but then all users will have the pain of having to get the tool on their own and to get the license and to get the binaries and to compile the binaries for their own architecture and so on and so on. This is something which in my opinion is already quite discouraging about using your software. So of course if you are able to put together a full-featured package which has all the pieces it needs to work of course I think it's better for anyone. That's why I asked Professor Ponder whether I could actually include in my distribution their binaries even if they are not strictly open source, and actually he was positive about that since given the open source nature of my project and the non-profit nature of the whole thing he was very eager to allow me to do this because he was really very collaborative and very interesting in the whole thing. I think this was a very good point in the whole story. AD: From what you were saying earlier, you said that he also knew there's now an atom typer for the Merck forcefield. PT: Actually it was him encouraging me to take the atom typing part of Open3DQSAR and encode it as a stand-alone tool which has the awkward name of sdf2xyz and so on. It was him really promoting the fact that I released this tools as a stand-alone tool. AD: Did you ever read The Hobbit? PT: Yes, I did. AD: "There and Back Again" PT: Well, I read it a long time ago so I don't have very clear memories about that. AD: The subtitle that Bilbo Baggins was "There and Back Again", so "sdf2xyz2sdf" is "There and Back Again." PT: That's true. AD: So you've got the whole work of talking with Tinker during the conformations... doing the quenched MD to get all the different conformations and get them back. You store them in some result you use then for the next stage. What's the intermediate conformer storage format? And the reason I'm asking this is because I know that you can save as an SD file but then you're just saving a bunch of stuff over and over again. Back in my MD days we just saved the coordinates so I was wondering if, oh! people should start doing this for their conformers. PT: The point of why the SDF format can actually store multiple conformation the XYZ format can't, because each XYZ file is for a single molecule. The way I tried to go around it in the simplest way for all the users is that you read the SDF file from standard input. In the SDF files you have a molecule name. In the case that the molecule name is actually defined you will obtain an XYZ file with proper name for each molecule which was in the original SD file and you will also obtain with the XYZ file a key file with the keywords which are needed to accomplish the computations. Of course as a default you get the basic set of keys then you can add for instance keys for implicit solvent models and whatever you want. PT: Actually there was a limitation in the original Tinker implementation of the Merck forcefield since it didn't work with charged heteraromatic rings and I could overcome this limitation since you could apply an external set of Merck forcefield charges and apparently OpenBabel can perfectly assign Merck forcefield charges while the original implementation of the Merck forcefield inside Tinker couldn't, so actually it looks like mixing the Open3DQSAR stuff with the Tinker stuff did well to both of the software so this this is a good point. AD: Okay, so you generated all of the conformers ... PT: Once you have the XYZ files you can get back to SDF file just using the original SD as a template and just the coordinates are updated in the original SD. That's quite straight-forward. AD: So the next step you do for your QSAR model is first you you generate pharmocophores. I saw that Silicos was saying that you use their "fay-ro?", "far-o?", "Pharao?" PT: "Pharao", I would say "Pharao", I think. I don't know if the correct pronounciation. AD: Using the Silicos Pharao tool; do you do most of the pharmacophore filtering with their tool? PT: At the beginning I thought I could use Pharao not that much for the pharmacorphore part but essentially it's an alignment tool. After testing Pharao with several data sets, especially with the data sets from the Southerland benchmark suite I could see that some of the data set were actually very fit to be aligned with a pharmacorphore based tool but some of the others, especially those which have an ill-defined pharmacorphore, they are not fit for that because all of the compounds making part of the data set have a different collection of pharmacorphore points so they are not really amenable to obtain a common template out of that pharmacophores. That's why I decided to enhance the alignment part with some tools coded on my own which are not pharmacophore based but atom based. That is, first you find the pairs of atoms which are most amenable to be matched and then you do the extra matching. That seems to enlarge the pool of possible molecules you can align by a large extent. AD: Alright, I'm lost. I thought this was part of the pharmacophore alignment you were doing as part of the precursor step for doing the grid alignment but I don't quite understand the 3D-QSAR well enough. PT: Well, the first tool which was born was the 3D-QSAR part. That implies that you already have some way to make the alignment. AD: So we haven't gotten to the point where you're doing the 3D-QSAR. We're still getting to the point where you're making the grids. PT: Yeah, so the reason I started from the 3D-QSAR tool and not from the alignment tool which would be the most logical way to do was that actually I already had the alignments made and I needed a tool to make the computation on a high-throughput basis. That is, without having to input each and every model from a graphical user interface as I used to do when I was using tools like GULP to make the chemometric analysis. PT: But then of course the tool was very fun for me because I had other ways to accomplish the alignment but after the 20th email where people were asking "ahh, that's a nice tool but how the hell am I going to align the compounds to make 3D-QSAR" I thought the time was ripe to make an alignment tool. AD: You started off with the [3D]-QSAR tool and from that you needed the alignment tool and for that you needed the conformer generation tool. PT: Exactly. AD: That's a lot of work. PT: Yep, yeah, it does. AD: So the pharmacophore alignment, it's the precursor to your 3D-QSAR code. For that are you using Pharao? PT: Yep. AD: And of course what's interesting to me is that Silicos came out last year saying "We're going to support OpenBabel. We'll release our stuff as open source. We'll make this available to the world." So here's an example where you said "the source code's available, I can use it for my tool" and it probably saved you a lot of time too. PT: Yep, and actually I think my implementation was actually one of the first using Pharao because I met Gert Thijs at a meeting in Goslar last year and actually they were very happy about seeing that someone was using their tool after just after I think a few weeks after they released it as open source. They also put the PDF of my poster on their web site as an example application using it so I was quite pleased with that. AD: How did you find out about their tool? Was it because they announced it to the OpenBabel list? PT: I saw the paper in Journal of Molecular Graphics and Modelling and the way I found it was because I was scraping the web to find a free, open source alignment tool because I wanted to try actually a kind of brute-force approach to 3D-QSAR model building and before spending a lot of time in coding my own alignment tool I wanted to have a hint whether this could make sense or not. I needed a tool to be able to test the whole procedure in a short time without losing a lot of time in something that probably didn't have a chance to work. As long as I saw that actually there was a chance that this was going to work then I coded my own alignment tool which was, not better than Pharao intrinsically but probably more fit for 3D-QSAR because actually the alignment that you need for 3D-QSAR are particularly consistent because every small inconsistency of the alignment of course increases the background noise in your models and hides the signal. That's why you need a quite special "pure posed" alignment tool, I would call, for 3D-QSAR. AD: You use the pharmacophore alignment to do the 6-degree of freedom alignment of your molecule to the reference, then you take that, transform your original conformation you generated, and then that's were you finally get to the point where you do the 3D-QSAR. PT: Yes. AD: Okay, so how does that go? That's embedded inside of a grid, you're working with grid space, and doing more work with forcefields there. Is that inside of OpenBabel? PT: Well that part of the calculation is extremely trivial because what you compute is just the non-bonded part of the forcefield because you're just computing the van der Waals, the Lennard-Jones potential and the Coulombic part of the potential. That's really an extremely easy part of the job to do and so you don't need any fancy molecular mechanics tool. You can code it in 10 minutes I would say. AD: And you're only talking about a few dozen atoms so the n-squared easy algorithm is not a problem. PT: No, definitely. AD: Okay, so you generate the grid, you assign energy points, you then do an overlay of the grids. This is the part where I'm not sure what goes on next. You subtract the energy values, or the difference of the energy values and then you ...? PT: No, you take the energy value grid for each and every molecule and then you build a PLS model of all the grid values so you have thousands, tens of thousands of energy values and the PLS is very [something] in picking out the ones which have a variance across the data set and just throw away all the points which are basically either 0 or they have the cutoff value because there is a long [something] steric clash or because your point is inside the molecule or many molecules of the data set are quite similar in certain parts and dissimilar only in small moieties. The PLS is good at extracting only the real different information across your data set. AD: Now this is the point where you hit the patent problem. PT: Yep. I admit that I was very naive, but I am alway naive in my life, because I was very much concern about this point because I thought "ahh, but I'm a good guy", I'm not doing that for - yeah, you're laughing - I was really convinced about that because I was doing it on a non-profit basis and this was an academic tool and of course I'm sure you could not even think of competing with a full-featured 3D modelling suite like Sybyl. I mean, that's a tool, a complete tool, with a graphical user interface which allows you to do every step from conformation search to alignments to 3D-QSAR model building. My tool at the moment was just doing 3D-QSAR model building so I don't think it was a competitor. PT: I was very confident when I wrote to Tripos that they would just tell me "ahh, yeah, just go along; it's not a problem." Especially because the patent had already expired in the USA and it was going to expire in Europe in one year anyway so I didn't think it would make a difference. But I received, I admit a very polite, but it was a "no", not a "yes." That forced me to use a very unpleasant geographic redistribution policy and also forced me not to be able to distribute it on a real open source basis. Of course the source was available, has always been available, but you could not modify or redistribute because you don't have a license to do that. This is going to end starting from June 18th, 2011, that is in 10 days it will be over because the patent will be expired everywhere in the world. I will finally be able to release it on a GPLv3 without issues. AD: You're releasing this under GPLv3. Why v3 instead of v2? I know OpenBabel has the v2 license for historical reasons. PT: I admit that I'm not a big expert of licenses. I read both the GPLv2 and v3 and I admit that I could not catch the fine differences between them and I thought, as any software guy is going, that I would take the most recent one. I admit that this is again probably very naive but you know I don't care very much about these things. AD: I've actually read and listened to various things about the GPLv3 and it sounds like a better license all around than the v2 is. There's some pretty good advantages to it that were considered in v3 with lots of input that in the early 1990s were not fully thought out when v2 came out. AD: So one of the things that you talk about on your web page is that this is an automated tool. How do you automate it? Is there a scripting language or commands or how? PT: I think that the main thrust behind the whole story is making a suite which was suited for high-throughput modelling. That is, completely unsupervised to the highest degree and scriptable. I thought quite a long time of it was convenient to make a pure command-line tool, but since the software has many many options it appeared to me making a command-line tool would just mean to add a plethora of options that no one would remember and in the end it would not have been very [unknown] as a command-line tool. I decided rather to go for an input script with a very simple syntax that was keyword equal to parameter; in a CHARMM-like fashion, I think could be a good example, or AMBER-like fashion. Small input scripts, text-based which are very amenable to building in an automated way with awk or sed or whatever stream editing tools. I think that was the best way to go around it. AD: Because it's for high-throughput, I think I saw something there about how it's parallelized. PT: It's parallelized using just pthreads. Of course pthreads on the Unix platforms and native Windows threads on Windows. Actually here I mentioned that the software is of course available as source code but it's also available as precompiled binary distributions for Linux, Windows, Mac OSX and Solaris and FreeBSD so you don't have to care about ... AD: I was looking at your gallery pages and I was surprised to see how many operating systems that you actually tested it on. PT: Actually that takes a long time. Each time you have to make just a small bug fix you have to rebuild all the versions. That's very time consuming, but I think it's quite essential, especially in the first stages of your software life that it's immediately available to users without their having to build it, especially when you have a number of different appendicies and you easily get lost in getting all of them, especially if you don't have big experience in building software from source. AD: I released a package last week, a week and a half ago now, and in the first few comments I got back were "are there prebuilt packages for Windows; I'm not used to compiling." So I agree. AD: In addition to all these operating systems I also saw that you had support for visualizating some of the results in, not only PyMol, but MOE, Sybyl, and Maestro. You do a lot of integration work there and that takes a good amount of time. PT: Of course as long as you are using coordinate files then any software suite can read MOL or SDF file so that's not a problem. The problem was about importing the binary grids, that is, the isocontours that visually show where you should add, for instance, steric [something], negative electrostatic potential in your structures. Those are all proprietary formats you have to output all the different formats for all the different software platforms that you want to target. For PyMol the format that PyMol can read is the same as the InsightII format. Talking about MOE, talking about Sybyl, talking about Schrodinger Maestro then all of them have different formats. I had to support all of them and test all of them. But again I think that it's essential that people who are used to certain tools don't have to switch just because your software cannot target the platform that you are used to. I think that one should always have users in mind when you release software rather than just your own convenience. Testing different platforms and different formats is also a very good way of debugging nasty things that otherwise you can overlook. Testing a couple of compilers many times shows you some weaknesses of your code that otherwise you can overlook, so it was not lost time I think. AD: So how do you do the testing. Do you have, for instance, automated tests, or do you mostly run manual tests? PT: I made basically a script which touches more or less all the features of the software in order to be able to ... Still I admit that sometimes it happens that adding some module to the software I could experience regressions in other parts that I forgot to test each and every time because you would never imagine that touching a part that you thought was miles away from that one would screw up another one. That comes with experience. I don't have big developing experience. I'm building it slowly. AD: How did you get into this? What you've told me now says that you have some development background but you're doing chemistry. How did you learn to do the software development? PT: I must say that I was born as an organic synthesis chemist. My career had gone from 1998 to 2006 as an organic chemist and the only thing I was using computers was for writing papers and sending emails. I had no kind of programming experience and I barely knew what a computer was. I didn't know about a world called Linux and all of this. I really started from scratch. I know that if some professional programmer will look at my code, probably especially the oldest part of my code, well, he would have some comments about it, but it works. AD: When programmers look at other programmers they also have comments, so that's not saying much. PT: Still, it was a big effort to, for instance, to parallelize things efficiently because the first release I made of my software was really slow. Before making the public release I was able just making better parallelization to improve like 10-per the original performance. I really understood how much space you have for improvements and optimization of the code. I think right now I made benchmarks against GULP, which is the only chemometrics tool that I had access to and the performance on one-to-on core is the same, but the nice point is that GULP is not parallel while my software is, so of course if you use an 8 core machine then you get 8-per performance. AD: You actually get 8-fold speedup? PT: Yep, because you don't have very much IO so it really scales linearly. That's a very good point. I'm really planning to do GPU porting of some of the code because it's extremely amneable to high degree of parallelization, so I expect to get very good speedups when I will do that, but I really didn't have the time. AD: Well thank you very much for your time. It's been enjoyable talking with you and learning more about how the 3D-QSAR program you've worked works and to hear more about the background of just what goes into writing such a program. PT: Thank you. I hope that I have somehow raised curiosity in the audience and that someone can actually try my tools and especially give me feedback about bugs, new features, and whatever you can imagine of. AD: So the name of the program and web site is? PT: The name of the web site is http://open3dqsar.org / and http://open3dalign.org / for the 3D-QSAR and alignment programs respectively so it's quite easy to remember. *music* AD: Thank you for listening to Molecular Coding. This podcast and transcript are distributed under the Creative Commons Attribution-Share-Alike 3.0 Unported license. The theme music was composed and performed by Andreas Steffen.…
In this episode recorded March 8, 2011, I talk with Brian Cole and Imran Haque about their experience with GPU computing. Brian works at OpenEye, and their ROCS implementation is about 1,000 times faster using GPUs than CPUs. Imran Haque is at Stanford where he implemented a GPU version of the LINGO algorithm and contributed to GPU Computing Gems: Emerald Edition. Transcript begins: A: Welcome to Molecular Coding: the podcast about the software side of cheminformatics. I started recording this series last November with two interviews at the German conference on cheminformatics in Goslar. Earlier this year, I recorded two more interviews at OpenEye's CUP user conference in Santa Fe, New Mexico. It took me until now to start editing and putting them online. For this podcast I talk with Imran Haque and Brian Cole about their experiences with GPU computing and their success at implementing the ROCS and Lingo algorithms on the GPU. *music* A: Welcome to my podcast, Molecular Coding. Came up with the name. Very happy with it. And we're now at the OpenEye CUP conference. 12th year of the CUP conferences in Santa Fe, New Mexico. I'm here with Imran and Brian and we're going to talk about GPUs and GPUs in computational chemistry. So, if you two could introduce yourselves. Brian? B: Alright, so I'm Brian Cole. I work for OpenEye. I run the conference here. I: I'm Imran Haque, a grad student at Vijay Pande's lab at Stanford. I'm working there on computational techniques for drug discovery. A: So, I wanted to talk with both of you because of both of you doing work with GPUs and making some of the algorithms for cheminformatics very fast. So, for instance, Brian, you've done work in making ROCS... I suppose you've done work in making ROCS fast. B: Right, correct. A: So, what is ROCS, first off? B: So, ROCS is... Literally, what it stands for is Rapid Overlay of Chemical Structures. Basically, it's a 3D comparison method for taking two chemical structures and trying to overlay them, and then, what you're actually calculating is the volume of the overlap. You get a similarity measure out of it called Tanimoto. So, essentially, they use it for...it could be applied to many, many different things, but the most traditional is virtual screening, basically. A: Alright. B: I have an active and I want to find another active in my database that looks like it. So... A: And so, how fast is ROCS? B: So, ROCS, typically on a CPU, the typical number we like to use is about a thousand per second. It's a good way to think of it. A: And FastROCS? B: FastROCS on like a desktop machine, can easily go a million a second. Over a million a second. A: So, a thousand times faster. B: A thousand times faster. A: What is it about the algorithm that makes it appropriate for GPUs. Why is this algorithm so much faster on a GPU? B: A lot of it comes down to basically: one, GPUs are single precision, and actually the algorithm doesn't need that much precision, so GPUs actually excel in precision for one. A: Mm, hm. B: The other is that they excel at 3D operations. It's a 3D method. And, a lot of it actually gets down to what GPUs are good at. And that's hiding memory latency. And actually the inter-loop of ROCS has always been hiding memory latency. Or, basically, a random hit into memory, if you can think of it that way. So, the CPUs would always be basically: you see all these very high cache missed rates. And, but, there was no way to get around that. GPUs offer a very nice way of basically just flooding the thing with as many threads as possible to try to hide that memory latency.. A: So, Imran, you've also been working on the FastROCS code. You also worked on other projects, too. I: Yeah, that's right. A: What else have you worked on? I: So, I haven't actually worked on the FastROCS. I developed my own version of a... of a GPU-enabled, you know, shape overlay method, which we call “Paper.” And it works slightly differently to FastROCS. But, some of the other things I've been working on are accelerated versions of, 2D chemical searching. So, for example, there's an algorithm called Lingo, which compares molecules by looking at overlapping sub-strings, in their SMILES strings and comparing the centers, intersections and differences between nodes. A: So, when you start working on an algorithm, like the Lingo algorithm, do you start with the algorithm that's submitted in a C code, and you just port that directly over, or how does the translation of an algorithm go? I: So, it depends a lot on the particular algorithm that you're...that you're using. So, um, for example, the algorithms used in shape overlay and in ROCS, are much more straight forward, I think, to put on a GPU because of things like memory latency and arithmetic intensity that are really well suited for a GPU. On the other hand, the algorithm that's canonical for CPU Lingo is actually very poorly suited for the GPU because it's extremely branch heavy. It's extremely branch heavy or it requires a very large jump table to implement a finite state automaton. A: That's the DFA solution for doing Lingo. I: Exactly. And so, in order to get Lingo to run well on the GPU, I actually had to go back sort of the definition of Lingo and figure out a completely different algorithm to implement it. To help it run efficiently. B: That was something we had to do as well. Was basically, from the ground up we had to look at every piece of the algorithm. A: For doing the FastROCS. B: For doing FastROCS and it was basically... It wasn't just taking, you know, some old C code that we had lying around and copying and pasting it in. It was literally every line of code was kind of inspected and analyzed in the of how will this run on the GPU. So, a lot of it was basically just uh...you know, going every line of code and going, “Okay, what would cause things like branch divergence?” Cause actually one of the tricky bits of ROCS is actually it's an optimizer. And by optimizers' very nature, they are branchy. A: Mm, hm. B: You know, “Do I go left? Do I go right?” laughs A: So you do solve this by... I mean, I know that in some cases when you do branching, you can say, merge the branches so that arithmetically, you're doing the branch...both branches but just use the proper result. How do you handle the branching problem? How do you solve it? How do you get around it? B: The trick is you just try to keep your IF statements as small as possible. That's kind of one way of thinking about it. A: So you cancel a lot of IF statements. B: Huh. It's … some of it's canceling out, you just...some of it's just kind of “okay, these IF statements are equivalent”. But other parts of it are just like...so, for example if you wanted to negate a vector kind of thing, but you only wanted to negate the vector sometimes. You could go like, “IF True, negate my entire vector,” or you can multiply the vector by minus one, you set a value of minus one. And always multiply that vector by some constant. But, you're switching on what that constant is in a very...a much smaller IF statement in that block. So you can use lots of tricks like that to kind of cut down on branch divergence. I: And actually, for really small statements like that the GPUs have hardware support to B: for... I: to... B: for predicating it... I: to deal with that. B: So if it's small enough it doesn't really affect anything... I: Right. There is no branch even though it looks like there is one in the code. A: So, if I wanted to get started doing GPU programming, I understand I can use my laptop and just work with that. Cause my laptop has this GPU in it... B: Mm, hm. I: Right. A: What do I do to get started? What software do I use? Laughing B: This is where we're gonna differ. I: I don't think so actually. So, I mean, I think for people who are starting out with GPU programming, it probably makes sense to look at OpenCL...As, as the language that you're going to use. So, there are essentially two major competing standards in the market right now for doing GPU programming. One of them is CUDA, which is a propriety language from NVIDIA; the other is OpenCL. In terms of the code that actually runs in the GPU, they're fairly similar C-like languages, with basically the same programming model. The code that you have on the host site to support it looks a little bit different. I think the key differences are really that CUDA runs only on NVIDIA whereas OpenCL is sort of an open vendor standard. But as a consequence, there are sometimes performance consequences and, you know, speed of feature adoption consequences between the two languages, so... B: I thought he was gonna say “CUDA,” so I'm gonna take the route and say CUDA, actually. Laughing I: Just to play devil's advocate was it? B: 'Cause if it's a hobbyist kind of person, and if it's just you know, you wanna experiment... A: On a Mac? B: On a Mac. And you're just a hobbyist and you just, you know, “I wanna do a little bit of GPU computing,” it would actually be quicker and easier to get up and running with CUDA. A: Why's that? B: Uh...simple things like just setting your kernel argument in OpenCL is an individual function call with an individual index for each one. I: Right. B: And it can be quite a nightmare to make sure, “okay, are all my arguments mapped up?” And CUDA takes care of that automatically by making a single source compile. I: So, what makes CUDA really nice is they actually have two different APIs to program in on the host side. One called the “Run-Time API” and one called the “Driver API.” And the run-time API makes things really simple. To call a function on the GPU, you basically, you know, add three angle brackets and it makes a function call – it's like it's going. Whereas as CL is basically equivalent to the CUDA driver API, which is a much bigger pain to use. Now, as far as getting started with GPU programming on a Mac, I'd actually say there's a bit of a caveat to that, in that the newest MacBook Pros are using AMD GPUs and so CUDA won't run on them. But, at least for the last generation ones, where they were using NVIDIA, then CUDA is still a good solution for that. B: Okay, so OpenCL, you could make it a lot easier by using some of the higher level things that are out there, like PyOpenCL, and C++Py leads to OpenCL. And that will take care of a lot of the kluginess of trying to use the C API, which is actually the standard. The standard's defined with C API. So, if you use some of the higher level stuff, like PyOpenCL, it takes of some of that klugy setting arguments and stuff like that for you. Plus it has a lot of other really nice high level features in it. So, yeah, you can use it. I: Yeah, and I'll throw in a plug for PyOpenCL and PyCUDA, if you're trying to do GPU stuff on Python, their actually really good. B: It's actually done by the same person. Laughing A: Okay, so now, pretend I have now done some GPU programming for a while, and I wanna make this work in my research group. How do I bring these in? Do I get a bunch of laptops with graphics cards, or what do I do? B: It really depends on what you're trying to do. So, like FastROCS was designed as like a server. It's kind of this large in memory service that any client can actually hit. So, it's more along the lines of buying one of these machines that has as many GPUs as you could pack into it as possible. A: And it can cool off. B: And can cool off, and a lot of other neat things like that. And it can handle the power draw. Actually, some of these rack-mountable things require, like 220 Volt power, that kind of thing. 'Cause they just draw a lot of power. So, if you're going for that sort of service-type architecture you can buy... for FastROCS actually, you can get a machine that's up and running for two million conformers a second for less than 20 grand. A: Okay. B: You go to a like a super micro reseller type company, you get, like, four Tesla GPUs and that for like 15 grand. Really, it depends on how much money you wanna spend. Because if you don't wanna spend a lot of money, you can actually get up and going with gamer cards. Gamer cards are actually faster than the Tesla high performance computing cards I: With a couple problems. B: With a couple problems, but they are faster. A: So you just take a couple gamer cards, plug it in...don't plug it in NVIDIA B: That's where part of the problem comes – is “a couple gamer cards.” Like, if you only want one card in your system, it's pretty easy to get a gamer card in there. It's when you're scaling the multiple that... I: Well, that's not really that big of a deal. So, my... B: If you can get the machine that can do it... I: So, we were looking for... So, to go back to your original question: “how do I get people working with them?” So, I think, you know, getting a couple laptops with GPUs is an alright solution when you start playing with it, but the performance implications when trying to deal with... in larger GPU are often quite different. Especially because the laptop GPU architecture tend to lag a generation or two behind; sometimes they're slightly different. B: And they're particularly low powered on purpose. I: Yes. B: To try to conserve power. A: They've never been good gaming machines. I: Also, the memory sub systems on laptop GPUs are usually far more performance ?? Particularly for memory-bound things like FastROCS, that'll make a big difference. So, when you're working with GPU, you could start off with a laptop, but you can't get good conclusive results until you've actually tested it on a real... I: I think so. B: You could easily see...I just ran this the other week – FastROCS on somebody's Quadro card in the office, and was getting like one-tenth the performance. But, the Quadro card wasn't designed for that sort of thing, it was designed for just visualization. I: Right, and a lot of the visualization cards are relatively low shade or count, and so they're perfectly adequate for graphics, but they're not really designed for really high-end compute stuff. For example, the first machine that I was programming on for GPU stuff was a low-end, like $60 GPU. When it became clear there really isn't there there, we got some high-end GPUs and started pushing those. So, what Brian was saying about gamer GPUs, the primary GPU desk machine I use in the lab now actually has a pair of gamer GPUs: a GTX 480 and GTX 260. And when we're buying the machine, we're trying to find, you know, just some online vendor where you can plug in...worried about power supply big enough to deal with two GPUs and so we compared, you know, Dell workstations and HP workstations, and in the end it turned out the cheapest solution was just an Alienware Gaming desktop. B: Any reason you mixed the 260 and the 480? 'Cause, they're different generations...was that on purpose? I: Yeah, well, no, it's because when we bought the machine, they were out of the 480, so I got one with a 260 and Vijay said he would buy my a 480 when it became available. A: I don't know if you look at it...Amazon now offers GPU custom... I: GPU nodes. A: GPU nodes, yes. And it's $2.10 an hour for two NVIDIA Tesla Fermis. That's good, I assume? I: I'm insulated from the costs, cause I work for an academic institution, so I don't really know. A: Cause I'm thinking about the overhead if you're developing this for yourself, in a group, you've got the “who maintains it?” “who knows the specialized knowledge?” of “what kind of power, CPU...” all the stuff you just talked about to get this working. And especially...when you do stuff with FastROCS, that's usually gonna be...a few minutes of work, or a few hours of work, or what? I: A typical query of just PubChem demo that we had just last night was something like 68 million conformers – that was 130 seconds to run. B: But, that depends on having it all loaded up. I: So, that depends on having it all loaded up, and that's where it becomes tricky. That's why you wouldn't want to go for a cloud solution like Amazon with FastROCS because load time is everything! The idea is, it runs as a service, continuously, and if you have to, you know, you could pay constantly to Amazon. Laughing A: I'm sure they'd love that! B: For one-off screening it might make sense but if you want to do large scale clustering where you're loading it once into memory and hitting it many many times while you actually have a GPU [something] going then it might be viable. I: Right, as long as you can amortize that load time. B: As long as you can amortize that load time. So one of the chief things you have to learn when you start to do GPU computing is it's all about data movement between your main memory and your GPU memory. Between your GPU memory and the local caches on your actual chips in the GPU. You're constantly thinking about that data memory and it's this hierarchy of data. That's the downside of the cloud - you have that network. I: And if you're working on really large data sets, which are I think a lot of the reason people are interested in GPUs, then you really do start to worry about moving from disk just to system memory. I think that's where load time in FastROCS as well as some load time in Paper comes from. We use slightly different data formats but I still have to read this giant HDF5 array from disk and load that in. A: Of course if you have the algorithm and you want to test it out on a real machine, you go to Amazon and try it out there first. I: Absolutely. B: Oh yeah. A: So how did you two ... these are the sort of skills not typically taught in computational chemistry. How did you get into this field? B: I got into this field through a friend actually. Basically he offered me an intership with Wyeth. I started doing Python programming in my teens and he's like "oh, you know Python? Well I've got a lot of Python that needs writing." I started it that way. A: Did you come in as a chemist? B: I came in .. actually it was only one year in college. I hadn't even selected a major yet. It actually kind of helped me select a major, getting into this field. "Oh, I do like computers and I like science so okay, I'll do that." But for how to learn GPU computing on your own, the best I can say is "read everything." Don't go "I'm programming in OpenCL so I shouldn't read the CUDA documentation" because there's a lot you can glean from the CUDA world and apply to OpenCL, and vice versa as well. A: I saw there's the book "GPU Gems" B: (pointing to Imran) And he's an author of a chapter. A: Congratulations. I: They were calling it volume 1 and volume 2 and now I think it's called Emerald edition. There's a couple of books out now called, one out now and one out in about six months, called GPU Computing Gems. The first volume is on methods in scientific computing and I think the second is on programmer libraries and more systems software support stuff. Basically what the books are is just a collection of case studies from different groups about applications they've managed to port successfully to the GPU and then sort of the tricks they used to make them work. The chapter I wrote is about how we make Paper go fast for shape overlay and how do we make LINGO go fast by this algorithmic transformation and sort of exploring the different optimizations you make along the way and what kind of performance impact that has. A: Nice. Is that book out already or is it soon? I: The first one is supposedly out now, though we haven't gotten the copy we're supposed to get so I'm not sure. B: I haven't read it yet since it isn't out, but do you actually walk through "okay, this was the first implementation and we transformed it this way?" I: The chapter that I wrote; obviously there's several thousand lines of code so it focuses on the main objective function kernel, what can we do with this? It gives you a listing for the critical few lines in there and says "okay, what can we do? Why's this slow?" and walks through four different iterations and shows by the end of this we got something that performs 3x as fast. And then what can we do the different kernel on the GPU as far as minimizing the data movement that Brian was talking about? It turns out that because the data movement is so bad there are actually a couple of steps of the algorithm that run really poorly on the GPU, like they are really underutilizing the GPU, but because you don't have to copy data back to the CPU that's a huge win. A: So I'll ask you the same question I asked Brian, which is how did you get into this field? You came into this field from chemistry and then you started doing computational work and then GPUs? I: No. I actually came in from the electrical engineering and computer science side, so I was a EE in undergrad and I did some undergrad research with a parallel computing group. The way that came about was actually long before GPUs were sort of flexibly programmable. This was like shader model 1.0 stuff. I talked to my advisor about doing some GPU computing project and she said "that's way too hard. Why don't you learn traditional supercomputers." So I did that and I did some EE stuff for a while. In undergrad I got interested in computational biology and that sort of stuff so when I came to grad school I told my current advisor that "you know, I'm interested in all these sorts of things." Basically the way I got into GPU computing was I was running ROCS. "I want to do really big data sets and it's not fast enough. I heard GPUs are cool. I wonder if I can do something with this." That's just how it came about. I: As far as how people can get into GPU computing I think Brian is right. Just read as many of the resources as you can. I think when both of us got started a lot of them didn't exist so it was a lot harder to figure out what you were doing. It took me like five or ten times reading the first few chapters of the CUDA programming guide before I understood the memory model it was talking about. Read the nVIDIA guide, I think the AMD docs have gotten a lot better as well. The CL spec is a spec so it's sometimes a bit hard to understand. I think the respective vendor programming guides give you a better introduction. And just write code. The first version of Paper I wrote, I remember coming back to it a year afterwards and just finding things that were horrifying after I learned about the GPU. But you know, it still worked. A: So what would be a good first project for someone in chemistry? ROCS or shape overlay? B: I like the Tanimoto. I: Which Tanimoto? B: The bit fingerprint Tanimoto. I: Yeah, the bit vector Tanimoto is not a bad one to go for. B: It's trivial to get your first thing in there and you'll get a huge speedup. There's also this iterative thing you can do to really really try to squeeze performance out of it as you go. I: I agree. It's a simple enough calculation and yet it encompasses a lot of the aspects of parallelism and bandwidth and things like that. B: You'll learn a lot. I: Yeah, it's self-contained. A: You both know I have blog where I write things every once in a while and I have a few entries on the Tanimoto, so once I get my first step into doing some GPU programming I'm going to try that one out, and then I'll see what you all did and see how much slower I am. Laughing. A: Alright, anything else to add? ... Sounds like a no. I: If you are doing fingerprint Tanimoto, I think it's easier to go faster in CUDA because it has a population count intrinsic - A: That's cheating! I: - so in OpenCL you'll have to write your own, but that's a cute problem on its own. A: I've done that, yes. A: Alright, so thank you both very much for talking about GPUs and let's get back to the conference. *music* A: I'm Andrew Dalke. Thank you for listening to Molecular Coding, and thanks to Imran and Brian for being my guests. The recording took place on March 8th, 2011 during a break at CUP. The content of Molecular Coding and any accompanying show notes are licensed under a Creative Commons Attribution No Derivative Works 3.0 US license. The theme music was composed and performed by Andreas Steffen. Transcript ends. Thanks to my wife, Sara Marie Dalke, for transcribing this episode.…
ברוכים הבאים אל Player FM!
Player FM סורק את האינטרנט עבור פודקאסטים באיכות גבוהה בשבילכם כדי שתהנו מהם כרגע. זה יישום הפודקאסט הטוב ביותר והוא עובד על אנדרואיד, iPhone ואינטרנט. הירשמו לסנכרון מנויים במכשירים שונים.